2025-05-09-12-03

Enigme: Generative Text Puzzles for Evaluating Reasoning in Language Models

Abstract

arXiv:2505.04914v1 Announce Type: new Abstract: Transformer-decoder language models are a core innovation in text based generative artificial intelligence. These models are being deployed as general-purpose intelligence systems in many applications. Central to their utility is the capacity to understand natural language commands and exploit the reasoning embedded in human text corpora to apply some form of reasoning process to a wide variety of novel tasks. To understand the limitations of this approach to generating reasoning we argue that we need to consider the architectural constraints of these systems. Consideration of the latent variable structure of transformer-decoder models allows us to design reasoning tasks that should probe the boundary of their capacity to reason. We present enigme, an open-source library for generating text-based puzzles to be used in training and evaluating reasoning skills within transformer-decoder models and future AI architectures.

摘要

基于Transformer解码器的语言模型是文本生成人工智能的核心创新技术。这些模型正作为通用智能系统被部署于众多应用场景。其功能的核心在于理解自然语言指令的能力，以及利用人类文本语料库中蕴含的推理机制，将某种形式的推理过程应用于各类新颖任务。为理解这种推理生成方法的局限性，我们认为需要考察这些系统的架构约束。通过分析Transformer解码器模型的潜在变量结构，我们得以设计出能够探测其推理能力边界的测试任务。本文提出Enigme——一个开源的文本谜题生成库，用于训练和评估Transformer解码器模型及未来AI架构的推理能力。

Position: Epistemic Artificial Intelligence is Essential for Machine Learning Models to Know When They Do Not Know

Abstract

arXiv:2505.04950v1 Announce Type: new Abstract: Despite the impressive achievements of AI, including advancements in generative models and large language models, there remains a significant gap in the ability of AI to handle uncertainty and generalize beyond the training data. We argue that AI models, especially in autonomous systems, fail to make robust predictions when faced with unfamiliar or adversarial data, as evidenced by incidents with autonomous vehicles. Traditional machine learning approaches struggle to address these issues due to an overemphasis on data fitting and domain adaptation. This position paper posits a paradigm shift towards epistemic artificial intelligence, emphasizing the need for models to learn not only from what they know but also from their ignorance. This approach, which focuses on recognizing and managing uncertainty, offers a potential solution to improve the resilience and robustness of AI systems, ensuring that they can better handle unpredictable real-world environments.

摘要

尽管人工智能已取得令人瞩目的成就，包括生成模型和大语言模型的进步，但其在处理不确定性和训练数据外泛化能力方面仍存在显著不足。我们认为，人工智能模型（尤其是自主系统中的模型）在面对陌生或对抗性数据时无法做出稳健预测，自动驾驶汽车的相关事故便佐证了这一点。传统机器学习方法因过度强调数据拟合和领域适应而难以解决这些问题。本立场论文提出向认知人工智能的范式转变，强调模型不仅需要从已知知识中学习，更需从未知中学习。这种以识别和管理不确定性为核心的方法，为提升人工智能系统的韧性和鲁棒性提供了潜在解决方案，从而确保其能更好地应对不可预测的现实环境。

Towards Artificial Intelligence Research Assistant for Expert-Involved Learning

Abstract

arXiv:2505.04638v1 Announce Type: new Abstract: Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research, yet their reliability and specific contributions to biomedical applications remain insufficiently characterized. In this study, we present \textbf{AR}tificial \textbf{I}ntelligence research assistant for \textbf{E}xpert-involved \textbf{L}earning (ARIEL), a multimodal dataset designed to benchmark and enhance two critical capabilities of LLMs and LMMs in biomedical research: summarizing extensive scientific texts and interpreting complex biomedical figures. To facilitate rigorous assessment, we create two open-source sets comprising biomedical articles and figures with designed questions. We systematically benchmark both open- and closed-source foundation models, incorporating expert-driven human evaluations conducted by doctoral-level experts. Furthermore, we improve model performance through targeted prompt engineering and fine-tuning strategies for summarizing research papers, and apply test-time computational scaling to enhance the reasoning capabilities of LMMs, achieving superior accuracy compared to human-expert corrections. We also explore the potential of using LMM Agents to generate scientific hypotheses from diverse multimodal inputs. Overall, our results delineate clear strengths and highlight significant limitations of current foundation models, providing actionable insights and guiding future advancements in deploying large-scale language and multi-modal models within biomedical research.

摘要

大语言模型（LLMs）与大模态模型（LMMs）已成为科学研究的变革性工具，但其在生物医学应用中的可靠性和具体贡献仍缺乏充分表征。本研究提出ARIEL（专家参与学习的人工智能研究助手），这是一个多模态数据集，旨在评估并增强LLMs与LMMs在生物医学研究中的两项关键能力：总结长篇科学文本和解析复杂生物医学图表。为支持严谨评估，我们创建了两套开源数据集，包含生物医学文献与图表及其配套问题。我们系统性地对开源与闭源基础模型进行基准测试，并引入博士级专家主导的人工评估。此外，通过针对性提示工程与微调策略提升研究论文摘要任务的模型性能，并应用测试时计算扩展增强LMMs的推理能力，其准确率已超越人类专家修正结果。我们还探索了利用LMM智能体从多模态输入生成科学假设的潜力。总体而言，研究结果明确了当前基础模型的优势，同时揭示了显著局限，为生物医学研究中大规模语言与多模态模型的部署提供了可行见解与发展方向。

Large Language Models are Autonomous Cyber Defenders

Abstract

arXiv:2505.04843v1 Announce Type: new Abstract: Fast and effective incident response is essential to prevent adversarial cyberattacks. Autonomous Cyber Defense (ACD) aims to automate incident response through Artificial Intelligence (AI) agents that plan and execute actions. Most ACD approaches focus on single-agent scenarios and leverage Reinforcement Learning (RL). However, ACD RL-trained agents depend on costly training, and their reasoning is not always explainable or transferable. Large Language Models (LLMs) can address these concerns by providing explainable actions in general security contexts. Researchers have explored LLM agents for ACD but have not evaluated them on multi-agent scenarios or interacting with other ACD agents. In this paper, we show the first study on how LLMs perform in multi-agent ACD environments by proposing a new integration to the CybORG CAGE 4 environment. We examine how ACD teams of LLM and RL agents can interact by proposing a novel communication protocol. Our results highlight the strengths and weaknesses of LLMs and RL and help us identify promising research directions to create, train, and deploy future teams of ACD agents.

摘要

快速有效的应急响应对于防范恶意网络攻击至关重要。自主网络防御（ACD）旨在通过规划与执行行动的人工智能（AI）代理实现响应自动化。现有ACD方法多聚焦于单代理场景并采用强化学习（RL），但RL训练的ACD代理存在训练成本高昂、决策过程缺乏可解释性及可迁移性等局限。大型语言模型（LLMs）能通过提供通用安全场景下的可解释行动来应对这些问题。尽管已有研究探索LLM代理在ACD中的应用，但尚未评估其在多代理场景或与其他ACD代理交互时的表现。本文通过提出CybORG CAGE 4环境的新集成方案，首次研究了LLM在多代理ACD环境中的性能表现。我们设计新型通信协议，考察LLM与RL代理组成的ACD团队如何协作。实验结果揭示了LLM与RL的优势与不足，为未来ACD代理团队的创建、训练和部署指明了研究方向。

Exploring Influence Factors on LLM Suitability for No-Code Development of End User IoT Applications

Abstract

arXiv:2505.04710v1 Announce Type: new Abstract: With the increasing popularity of IoT applications, end users demand more personalized and intuitive functionality. A major obstacle for this, however, is that custom IoT functionality today still requires at least some coding skills. To address this, no-code development platforms have been proposed as a solution for empowering non-technical users to create applications. However, such platforms still require a certain level of technical expertise for structuring process steps or defining event-action relations. The advent of LLMs can further enhance no-code platforms by enabling natural language-based interaction, automating of complex tasks, and dynamic code generation. By allowing users to describe their requirements in natural language, LLMs can significantly streamline no-code development. As LLMs vary in performance, architecture, training data used, and the use cases they target, it is still unclear which models are best suited and what are the influence factors determining this fit. In particular, no-code development of IoT applications by non-technical users will have completely different demands on LLMs than, e.g., code generation for more open-ended applications or for supporting professional developers. In this paper, we explore the factors influencing the suitability of LLMs to no-code development of IoT applications. We also examine the role of input prompt language on accuracy and quality of generated applications as well as the influence of LLM training data. By conducting comprehensive experiments with a range of LLMs, we provide valuable insights for optimizing LLM-powered no-code platforms, guiding the selection of the suitable LLMs and their effective application. Our findings contribute to improving the accessibility, efficiency, and user experience of no-code IoT development, ultimately enabling broader adoption of IoT technologies among non-expert users.

摘要

随着物联网应用的日益普及，终端用户对个性化和直观功能的需求不断增长。然而当前定制化物联网功能仍需至少具备一定编程能力，这成为主要障碍。为解决该问题，无代码开发平台被提出作为赋能非技术用户创建应用的解决方案。但此类平台在构建流程步骤或定义事件-动作关系时仍需要一定技术专长。大型语言模型（LLM）的出现通过实现基于自然语言的交互、复杂任务自动化及动态代码生成，可进一步提升无代码平台能力。当用户能够以自然语言描述需求时，LLM可显著简化无代码开发流程。由于LLM在性能、架构、训练数据及应用场景方面存在差异，目前尚不清楚哪些模型最适合以及决定适配性的影响因素。特别是非技术用户进行物联网应用的无代码开发对LLM的要求，与开放式应用的代码生成或专业开发者辅助等场景存在本质区别。本文探究了影响LLM适用于物联网无代码开发的关键因素，研究了输入提示语言对生成应用准确性和质量的作用，以及LLM训练数据的影响。通过针对多种LLM开展综合实验，我们为优化基于LLM的无代码平台提供了重要见解，指导合适LLM的选择及其有效应用。本研究有助于提升无代码物联网开发的易用性、效率和用户体验，最终促进非专业用户更广泛地采用物联网技术。

Text2Cypher: Data Pruning using Hard Example Selection

Abstract

arXiv:2505.05122v1 Announce Type: new Abstract: Database query languages such as SQL for relational databases and Cypher for graph databases have been widely adopted. Recent advancements in large language models (LLMs) enable natural language interactions with databases through models like Text2SQL and Text2Cypher. Fine-tuning these models typically requires large, diverse datasets containing non-trivial examples. However, as dataset size increases, the cost of fine-tuning also rises. This makes smaller, high-quality datasets essential for reducing costs for the same or better performance. In this paper, we propose five hard-example selection techniques for pruning the Text2Cypher dataset, aiming to preserve or improve performance while reducing resource usage. Our results show that these hard-example selection approaches can halve training time and costs with minimal impact on performance, and demonstrates that hard-example selection provides a cost-effective solution.

摘要

关系型数据库的SQL和图数据库的Cypher等查询语言已被广泛采用。大型语言模型（LLMs）的最新进展使得通过Text2SQL和Text2Cypher等模型实现与数据库的自然语言交互成为可能。微调这些模型通常需要包含非平凡示例的大规模多样化数据集。然而，随着数据集规模增大，微调成本也随之上升。这使得在保持或提升性能的同时，小型高质量数据集对于降低成本至关重要。本文提出五种困难样本选择技术用于修剪Text2Cypher数据集，旨在减少资源使用的同时保持或提升性能。实验结果表明，这些困难样本选择方法可将训练时间和成本减半且对性能影响极小，证明困难样本选择是一种高性价比的解决方案。

The Promise and Limits of LLMs in Constructing Proofs and Hints for Logic Problems in Intelligent Tutoring Systems

Abstract

arXiv:2505.04736v1 Announce Type: new Abstract: Intelligent tutoring systems have demonstrated effectiveness in teaching formal propositional logic proofs, but their reliance on template-based explanations limits their ability to provide personalized student feedback. While large language models (LLMs) offer promising capabilities for dynamic feedback generation, they risk producing hallucinations or pedagogically unsound explanations. We evaluated the stepwise accuracy of LLMs in constructing multi-step symbolic logic proofs, comparing six prompting techniques across four state-of-the-art LLMs on 358 propositional logic problems. Results show that DeepSeek-V3 achieved superior performance with 84.4% accuracy on stepwise proof construction and excelled particularly in simpler rules. We further used the best-performing LLM to generate explanatory hints for 1,050 unique student problem-solving states from a logic ITS and evaluated them on 4 criteria with both an LLM grader and human expert ratings on a 20% sample. Our analysis finds that LLM-generated hints were 75% accurate and rated highly by human evaluators on consistency and clarity, but did not perform as well explaining why the hint was provided or its larger context. Our results demonstrate that LLMs may be used to augment tutoring systems with logic tutoring hints, but requires additional modifications to ensure accuracy and pedagogical appropriateness.

摘要

智能辅导系统在教授形式命题逻辑证明方面已显示出有效性，但其基于模板的解释方式限制了提供个性化学生反馈的能力。虽然大型语言模型（LLMs）为动态反馈生成提供了有前景的能力，但它们存在产生幻觉或教学上不合理的解释的风险。我们评估了LLMs在构建多步符号逻辑证明中的逐步准确性，在358个命题逻辑问题上比较了四种最先进LLMs的六种提示技术。结果显示，DeepSeek-V3在逐步证明构建中以84.4%的准确率表现出色，尤其在简单规则上表现优异。我们进一步使用性能最佳的LLM为逻辑智能辅导系统中的1,050个独特学生问题解决状态生成解释性提示，并通过LLM评分器和人类专家对20%样本的4项标准进行评估。分析发现，LLM生成的提示准确率为75%，在一致性和清晰度方面获得人类评估者的高度评价，但在解释提示的提供原因及其更大背景方面表现不佳。我们的结果表明，LLMs可用于为辅导系统增强逻辑辅导提示，但需要进一步修改以确保准确性和教学适宜性。

Enhancing Text2Cypher with Schema Filtering

Abstract

arXiv:2505.05118v1 Announce Type: new Abstract: Knowledge graphs represent complex data using nodes, relationships, and properties. Cypher, a powerful query language for graph databases, enables efficient modeling and querying. Recent advancements in large language models allow translation of natural language questions into Cypher queries - Text2Cypher. A common approach is incorporating database schema into prompts. However, complex schemas can introduce noise, increase hallucinations, and raise computational costs. Schema filtering addresses these challenges by including only relevant schema elements, improving query generation while reducing token costs. This work explores various schema filtering methods for Text2Cypher task and analyzes their impact on token length, performance, and cost. Results show that schema filtering effectively optimizes Text2Cypher, especially for smaller models. Consistent with prior research, we find that larger models benefit less from schema filtering due to their longer context capabilities. However, schema filtering remains valuable for both larger and smaller models in cost reduction.

摘要

知识图谱通过节点、关系和属性来表征复杂数据。Cypher作为一种强大的图数据库查询语言，能够实现高效的数据建模与查询。随着大语言模型的发展，自然语言问题到Cypher查询的转换（Text2Cypher）成为可能。当前主流方法是将数据库模式整合至提示词中，但复杂模式可能引入噪声、加剧幻觉现象并增加计算成本。模式过滤技术通过仅保留相关模式元素来解决这些问题，在提升查询生成质量的同时降低标记开销。本研究系统探讨了Text2Cypher任务中不同模式过滤方法，并分析了其对标记长度、性能及成本的影响。实验结果表明，模式过滤能有效优化Text2Cypher任务，尤其对小型模型效果显著。与既有研究一致，我们发现大型模型因其长上下文处理能力从模式过滤中获益较少。但值得注意的是，模式过滤在降低各类模型使用成本方面仍具有重要价值。

ChemRxivQuest: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv Preprints

Abstract

arXiv:2505.05232v1 Announce Type: new Abstract: The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemRxivQuest, a curated dataset of 970 high-quality question-answer (QA) pairs derived from 155 ChemRxiv preprints across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemRxivQuest was constructed using an automated pipeline that combines optical character recognition (OCR), GPT-4o-based QA generation, and a fuzzy matching technique for answer verification. The dataset emphasizes conceptual, mechanistic, applied, and experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset's structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemRxivQuest provides a foundational resource for chemistry NLP research, education, and tool development.

摘要

化学文献的快速扩张对研究人员高效获取领域特定知识提出了重大挑战。为支持化学领域自然语言处理（NLP）的发展，我们推出ChemRxivQuest——一个从17个化学子学科的155篇ChemRxiv预印本中提取的970组高质量问答对（QA）的精选数据集。每个问答对均明确关联至源文本片段，确保可追溯性和上下文准确性。该数据集通过结合光学字符识别（OCR）、基于GPT-4o的问答生成及模糊匹配答案验证技术的自动化流程构建而成，重点关注概念性、机理性、应用性和实验性问题，可应用于检索式问答系统、搜索引擎开发及领域适配大语言模型的微调。我们分析了数据集的结构、覆盖范围和局限性，并规划了未来扩展与专家验证的方向。ChemRxivQuest为化学NLP研究、教育及工具开发提供了基础性资源。

Abstract

arXiv:2505.05177v1 Announce Type: new Abstract: Large Language Models (LLMs) assist in specialized tasks but struggle to align with evolving domain knowledge without costly fine-tuning. Domain knowledge consists of: Knowledge: Immutable facts (e.g., 'A stone is solid') and generally accepted principles (e.g., ethical standards); Refined Memory: Evolving insights shaped by business needs and real-world changes. However, a significant gap often exists between a domain expert's deep, nuanced understanding and the system's domain knowledge, which can hinder accurate information retrieval and application. Our Memory-Augmented Refinement of Knowledge (MARK) framework enables LLMs to continuously learn without retraining by leveraging structured refined memory, inspired by the Society of Mind. MARK operates through specialized agents, each serving a distinct role: Residual Refined Memory Agent: Stores and retrieves domain-specific insights to maintain context over time; User Question Refined Memory Agent: Captures user-provided facts, abbreviations, and terminology for better comprehension; LLM Response Refined Memory Agent: Extracts key elements from responses for refinement and personalization. These agents analyse stored refined memory, detect patterns, resolve contradictions, and improve response accuracy. Temporal factors like recency and frequency prioritize relevant information while discarding outdated insights. MARK enhances LLMs in multiple ways: Ground Truth Strategy: Reduces hallucinations by establishing a structured reference; Domain-Specific Adaptation: Essential for fields like healthcare, law, and manufacturing, where proprietary insights are absent from public datasets; Personalized AI Assistants: Improves virtual assistants by remembering user preferences, ensuring coherent responses over time.

摘要

大语言模型（LLMs）能够辅助专业任务，但在不进行昂贵微调的情况下难以适应不断演进的领域知识。领域知识包含两方面：知识：不可变事实（如"石头是固体"）和普遍接受的原则（如伦理标准）；精炼记忆：由业务需求和现实变化塑造的演进见解。然而，领域专家的深刻、细致理解与系统领域知识之间常存在显著差距，这可能阻碍准确的信息检索和应用。受"心智社会"启发，我们提出的知识精炼记忆增强框架（MARK）使LLMs无需重新训练即可持续学习。MARK通过专业代理运作，每个代理承担特定职能：残余精炼记忆代理：存储并检索领域特定见解以维持长期上下文；用户问题精炼记忆代理：捕获用户提供的事实、缩写和术语以提升理解；LLM响应精炼记忆代理：从响应中提取关键要素进行精炼和个性化。这些代理分析存储的精炼记忆，检测模式，解决矛盾并提高响应准确性。通过时效性和频率等时间因素对相关信息进行优先级排序，同时淘汰过时见解。MARK从多维度增强LLMs：基准事实策略：通过建立结构化参照减少幻觉；领域特定适配：对医疗、法律和制造等缺乏公开数据专有见解的领域至关重要；个性化AI助手：通过记忆用户偏好改进虚拟助手，确保长期响应连贯性。

Multi-agent Embodied AI: Advances and Future Directions

Abstract

arXiv:2505.05108v1 Announce Type: new Abstract: Embodied artificial intelligence (Embodied AI) plays a pivotal role in the application of advanced technologies in the intelligent era, where AI systems are integrated with physical bodies that enable them to perceive, reason, and interact with their environments. Through the use of sensors for input and actuators for action, these systems can learn and adapt based on real-world feedback, allowing them to perform tasks effectively in dynamic and unpredictable environments. As techniques such as deep learning (DL), reinforcement learning (RL), and large language models (LLMs) mature, embodied AI has become a leading field in both academia and industry, with applications spanning robotics, healthcare, transportation, and manufacturing. However, most research has focused on single-agent systems that often assume static, closed environments, whereas real-world embodied AI must navigate far more complex scenarios. In such settings, agents must not only interact with their surroundings but also collaborate with other agents, necessitating sophisticated mechanisms for adaptation, real-time learning, and collaborative problem-solving. Despite increasing interest in multi-agent systems, existing research remains narrow in scope, often relying on simplified models that fail to capture the full complexity of dynamic, open environments for multi-agent embodied AI. Moreover, no comprehensive survey has systematically reviewed the advancements in this area. As embodied AI rapidly evolves, it is crucial to deepen our understanding of multi-agent embodied AI to address the challenges presented by real-world applications. To fill this gap and foster further development in the field, this paper reviews the current state of research, analyzes key contributions, and identifies challenges and future directions, providing insights to guide innovation and progress in this field.

摘要

具身人工智能（Embodied AI）在智能时代先进技术应用中发挥着关键作用，其通过将AI系统与物理载体结合，使系统能够感知、推理并与环境交互。这些系统利用传感器获取输入，通过执行器采取行动，并基于现实世界反馈进行学习与适应，从而在动态不可预测的环境中高效执行任务。随着深度学习（DL）、强化学习（RL）和大语言模型（LLM）等技术的成熟，具身AI已成为学界与工业界的前沿领域，应用涵盖机器人、医疗、交通和制造业。然而，现有研究多集中于假设静态封闭环境的单智能体系统，而现实世界的具身AI需应对更复杂的场景。在此类场景中，智能体不仅需与环境交互，还需与其他智能体协作，这就要求其具备适应机制、实时学习及协同问题解决等高级能力。尽管多智能体系统研究日益受到关注，现有工作仍局限于简化模型，未能充分捕捉动态开放环境中多智能体具身AI的完整复杂性。此外，目前尚无系统性综述全面梳理该领域的进展。随着具身AI的快速发展，深入理解多智能体具身AI对应对实际应用挑战至关重要。为填补这一空白并推动领域发展，本文回顾了当前研究现状，分析了关键贡献，指出挑战与未来方向，旨在为该领域的创新与进步提供指导性见解。

CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models

Abstract

arXiv:2505.05130v1 Announce Type: new Abstract: Large pre-trained Vision-Language Models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), have exhibited remarkable zero-shot performance across various image classification tasks. Fine-tuning these models on domain-specific datasets further enhances their effectiveness for downstream applications. However, fine-tuning in cloud environments raises significant concerns regarding data security and privacy. Federated Learning (FL) offers a decentralized solution by enabling model training across local clients without centralizing sensitive data, but the high communication and computation costs of transmitting full pre-trained models during training limit its scalability. Additionally, non-Independent and Identically Distributed (non-IID) data across local clients can negatively impact model convergence and performance. To address these challenges, we propose CacheFL, a novel federated learning method that replaces traditional full model fine-tuning with lightweight cache model fine-tuning. The cache model is initialized using a class-balanced dataset generated by a generative pre-trained model, effectively mitigating the impact of non-IID data. This cache model is then distributed to local clients for fine-tuning, and the updated parameters from each client are aggregated on the server and redistributed. With the updated cache model, the classification performance of CLIP is improved after just a few epochs. By limiting the training and communication to the cache model, CacheFL significantly reduces resource demands while ensuring data privacy and security. Extensive experiments conducted on ImageNet and 10 additional datasets demonstrate that CacheFL outperforms traditional approaches in terms of classification accuracy, resource efficiency, and privacy preservation.

摘要

大规模预训练视觉语言模型（VLMs），例如对比语言-图像预训练（CLIP），在各种图像分类任务中展现出卓越的零样本性能。在特定领域数据集上对这些模型进行微调，可进一步提升其在下游应用中的有效性。然而，云端环境中的微调引发了数据安全与隐私方面的重大隐忧。联邦学习（FL）通过允许模型在本地客户端上进行训练而无需集中敏感数据，提供了一种去中心化解决方案，但训练期间传输完整预训练模型的高通信与计算成本限制了其可扩展性。此外，本地客户端间的非独立同分布（non-IID）数据可能对模型收敛与性能产生负面影响。为解决这些挑战，我们提出CacheFL——一种新颖的联邦学习方法，以轻量级缓存模型微调替代传统的完整模型微调。该缓存模型通过生成式预训练模型生成的类别平衡数据集初始化，有效缓解非IID数据的影响。随后将该缓存模型分发至本地客户端进行微调，并将各客户端的更新参数在服务器端聚合后重新分发。借助更新的缓存模型，CLIP的分类性能仅需少量训练周期即可提升。通过将训练与通信限制在缓存模型内，CacheFL在确保数据隐私与安全的同时显著降低了资源需求。在ImageNet及另外10个数据集上的大量实验表明，CacheFL在分类准确率、资源效率与隐私保护方面均优于传统方法。

EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation

Abstract

arXiv:2505.05440v1 Announce Type: new Abstract: Cloud-based mobile agents powered by (multimodal) large language models ((M)LLMs) offer strong reasoning abilities but suffer from high latency and cost. While fine-tuned (M)SLMs enable edge deployment, they often lose general capabilities and struggle with complex tasks. To address this, we propose EcoAgent, an Edge-Cloud cOllaborative multi-agent framework for mobile automation. EcoAgent features a closed-loop collaboration among a cloud-based Planning Agent and two edge-based agents: the Execution Agent for action execution and the Observation Agent for verifying outcomes. The Observation Agent uses a Pre-Understanding Module to compress screen images into concise text, reducing token usage. In case of failure, the Planning Agent retrieves screen history and replans via a Reflection Module. Experiments on AndroidWorld show that EcoAgent maintains high task success rates while significantly reducing MLLM token consumption, enabling efficient and practical mobile automation.

摘要

基于云端、由（多模态）大语言模型（(M)LLMs）驱动的移动智能体虽具备强大的推理能力，但存在高延迟和高成本问题。虽然经过微调的(M)SLMs可实现边缘部署，但通常会丧失通用能力且难以处理复杂任务。为此，我们提出EcoAgent——一种面向移动自动化的边缘-云端协同多智能体框架。该框架通过云端规划智能体与两个边缘智能体（执行智能体负责动作执行，观察智能体负责结果验证）形成闭环协作。观察智能体采用预理解模块将屏幕图像压缩为简洁文本，显著降低token消耗。当任务失败时，规划智能体通过反射模块检索屏幕历史并重新规划。在AndroidWorld上的实验表明，EcoAgent在保持高任务成功率的同时，大幅减少了大语言模型的token消耗，实现了高效实用的移动自动化。

HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow

Abstract

arXiv:2505.05286v1 Announce Type: new Abstract: Recent advances in leveraging the agentic paradigm of large language models (LLMs) utilization have significantly enhanced Text-to-SQL capabilities, enabling users without specialized database expertise to query data intuitively. However, deploying these agentic LLM-based Text-to-SQL systems in production poses substantial challenges due to their inherently multi-stage workflows, stringent latency constraints, and potentially heterogeneous GPU infrastructure in enterprise environments. Current LLM serving frameworks lack effective mechanisms for handling interdependent inference tasks, dynamic latency variability, and resource heterogeneity, leading to suboptimal performance and frequent service-level objective (SLO) violations. In this paper, we introduce HEXGEN-TEXT2SQL, a novel framework designed explicitly to schedule and execute agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters that handle multi-tenant end-to-end queries. HEXGEN-TEXT2SQL introduce a hierarchical scheduling approach combining global workload-balanced task dispatching and local adaptive urgency-guided prioritization, guided by a systematic analysis of agentic Text-to-SQL workflows. Additionally, we propose a lightweight simulation-based method for tuning critical scheduling hyperparameters, further enhancing robustness and adaptability. Our extensive evaluation on realistic Text-to-SQL benchmarks demonstrates that HEXGEN-TEXT2SQL significantly outperforms state-of-the-art LLM serving frameworks. Specifically, HEXGEN-TEXT2SQL reduces latency deadlines by up to 1.67 $\times$ (average: 1.41 $\times$ ) and improves system throughput by up to 1.75 $\times$ (average: 1.65 $\times$ ) compared to vLLM under diverse, realistic workload conditions. Our code is available at https://github.com/Relaxed-System-Lab/Hexgen-Flow.

摘要

近年来，基于大语言模型（LLMs）智能体范式应用的重大进展显著提升了文本到SQL（Text-to-SQL）的能力，使得不具备专业数据库知识的用户能够直观地进行数据查询。然而，由于这类基于智能体LLM的文本到SQL系统本质上具有多阶段工作流程、严格的延迟约束以及企业环境中潜在的异构GPU基础设施，将其部署到生产环境面临巨大挑战。当前LLM服务框架缺乏有效机制来处理相互依赖的推理任务、动态延迟变化和资源异构性，导致性能欠佳和频繁违反服务级别目标（SLO）。本文提出HEXGEN-TEXT2SQL，这是一个专为在异构GPU集群上调度和执行基于智能体多阶段LLM的文本到SQL工作流而设计的新框架，可处理多租户端到端查询。HEXGEN-TEXT2SQL引入了一种分层调度方法，结合全局负载均衡的任务分发和局部自适应紧急度引导的优先级排序，该方法基于对智能体文本到SQL工作流的系统分析。此外，我们提出了一种基于轻量级模拟的关键调度超参数调优方法，进一步增强了系统的鲁棒性和适应性。在真实文本到SQL基准测试中的广泛评估表明，HEXGEN-TEXT2SQL显著优于最先进的LLM服务框架。具体而言，与vLLM相比，HEXGEN-TEXT2SQL在不同真实工作负载条件下将延迟截止时间缩短了最高1.67倍（平均1.41倍），并将系统吞吐量提高了最高1.75倍（平均1.65倍）。我们的代码可在https://github.com/Relaxed-System-Lab/Hexgen-Flow获取。

Abstract

arXiv:2505.04628v1 Announce Type: cross Abstract: Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.

摘要

扩大大型语言模型（LLMs）在社会生活中的应用，而不仅限于作为与单一用户交互的辅助工具，需要LLMs具备在复杂社会情境中独立承担多用户、多轮次社交代理任务的能力。然而，当前尚缺乏系统性评估该能力的基准测试。为此，我们首先基于社会学原理提出了一个代理任务分级框架，同时设计了一个名为“How Social Is It”（简称HSII）的新型基准测试，用于全面评估LLMs在社交代理任务中的社会能力并对代表性模型进行基准测试。HSII包含四个阶段：格式解析、目标选择、目标切换对话和稳定对话，通过源自新闻数据集逐步构建的真实社交互动场景数据集HSII-Dataset，综合评估LLMs的沟通与任务完成能力。我们通过对数据集进行聚类分析开展了消融实验，并研究了思维链（COT）方法对提升LLMs社交表现的影响。鉴于COT会消耗更多计算资源，我们进一步提出了新的统计指标COT复杂度，用以量化特定LLMs在完成特定社交任务时使用COT的效率，从而在正确性与效率评估之间实现更好平衡。大量实验结果表明，我们的基准测试能有效评估LLMs的社交技能。

Conversational Process Model Redesign

Abstract

arXiv:2505.05453v1 Announce Type: new Abstract: With the recent success of large language models (LLMs), the idea of AI-augmented Business Process Management systems is becoming more feasible. One of their essential characteristics is the ability to be conversationally actionable, allowing humans to interact with the LLM effectively to perform crucial process life cycle tasks such as process model design and redesign. However, most current research focuses on single-prompt execution and evaluation of results, rather than on continuous interaction between the user and the LLM. In this work, we aim to explore the feasibility of using LLMs to empower domain experts in the creation and redesign of process models in an iterative and effective way. The proposed conversational process model redesign (CPD) approach receives as input a process model and a redesign request by the user in natural language. Instead of just letting the LLM make changes, the LLM is employed to (a) identify process change patterns from literature, (b) re-phrase the change request to be aligned with an expected wording for the identified pattern (i.e., the meaning), and then to (c) apply the meaning of the change to the process model. This multi-step approach allows for explainable and reproducible changes. In order to ensure the feasibility of the CPD approach, and to find out how well the patterns from literature can be handled by the LLM, we performed an extensive evaluation. The results show that some patterns are hard to understand by LLMs and by users. Within the scope of the study, we demonstrated that users need support to describe the changes clearly. Overall the evaluation shows that the LLMs can handle most changes well according to a set of completeness and correctness criteria.

摘要

随着大型语言模型(LLM)近年来的成功应用，人工智能增强型业务流程管理系统的构想正变得愈发可行。其核心特征之一是具备对话式可操作性，允许人类通过与LLM的有效交互来执行关键流程生命周期任务，如流程模型设计与再设计。然而当前大多数研究仅关注单次提示的执行与结果评估，而非用户与LLM间的持续交互。本研究旨在探索如何利用LLM以迭代有效的方式赋能领域专家进行流程模型的创建与再设计。我们提出的对话式流程模型再设计(CPD)方法接收两个输入：流程模型和用户以自然语言表述的再设计请求。该方法并非直接让LLM实施修改，而是分三步：(a)从文献中识别流程变更模式；(b)将变更请求重新表述为与识别模式预期表述方式(即语义)相一致的表达；(c)将变更语义应用于流程模型。这种多步骤方法确保了变更的可解释性与可复现性。为验证CPD方法的可行性并评估LLM对文献中模式的处理能力，我们开展了全面评估。结果表明部分模式对LLM和用户而言较难理解。研究范围内，我们证实用户需要辅助才能清晰描述变更需求。总体评估显示，根据完整性与正确性标准，LLM能较好地处理大多数变更需求。

Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs

Abstract

arXiv:2505.04637v1 Announce Type: cross Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual-linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models' capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering, +5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.

摘要

尽管多模态大语言模型（MLLMs）的最新进展展现了处理多样化数据类型的卓越能力，但人类认知过程与计算式多模态信息整合方法之间仍存在显著差异。本研究系统性地探究了人类跨模态组块机制与MLLMs中令牌表征方法的相似性。通过对比人类在视觉-语言任务中的表现模式与模型行为的实证研究，我们发现传统静态令牌化方案从根本上限制了现有模型模拟人类动态、上下文敏感信息处理的能力。为此，我们提出了一种新颖的动态跨模态令牌化框架，该框架融合了自适应边界、分层表征以及基于认知科学原理的对齐机制。定量评估表明，我们的方法在基准任务上实现了相对于最先进模型的统计学显著提升（视觉问答任务提升7.8%，复杂场景描述任务提升5.3%），同时表现出更符合人类认知的错误模式和注意力分布。这些发现深化了人类认知与人工智能关系的理论理解，并为开发更具认知合理性的AI系统提供了实证依据。

Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising

Abstract

arXiv:2505.04665v1 Announce Type: cross Abstract: Although large language models have demonstrated the potential for personalized advertising recommendations in experimental environments, in actual operations, how advertising recommendation systems can be combined with measures such as user privacy protection and data security is still an area worthy of in-depth discussion. To this end, this paper studies the personalized risks and regulatory strategies of large language models in digital advertising. This study first outlines the principles of Large Language Model (LLM), especially the self-attention mechanism based on the Transformer architecture, and how to enable the model to understand and generate natural language text. Then, the BERT (Bidirectional Encoder Representations from Transformers) model and the attention mechanism are combined to construct an algorithmic model for personalized advertising recommendations and user factor risk protection. The specific steps include: data collection and preprocessing, feature selection and construction, using large language models such as BERT for advertising semantic embedding, and ad recommendations based on user portraits. Then, local model training and data encryption are used to ensure the security of user privacy and avoid the leakage of personal data. This paper designs an experiment for personalized advertising recommendation based on a large language model of BERT and verifies it with real user data. The experimental results show that BERT-based advertising push can effectively improve the click-through rate and conversion rate of advertisements. At the same time, through local model training and privacy protection mechanisms, the risk of user privacy leakage can be reduced to a certain extent.

摘要

尽管大型语言模型在实验环境中已展现出个性化广告推荐的潜力，但在实际运营中，广告推荐系统如何与用户隐私保护、数据安全等措施相结合仍是值得深入探讨的领域。为此，本文研究了数字广告中大型语言模型的个性化风险与监管策略。本研究首先阐述了大型语言模型（LLM）的原理，特别是基于Transformer架构的自注意力机制，以及如何使模型理解并生成自然语言文本。随后，结合BERT（基于Transformer的双向编码器表示）模型与注意力机制，构建了面向个性化广告推荐与用户因素风险保护的算法模型。具体步骤包括：数据收集与预处理、特征选择与构建、利用BERT等大型语言模型进行广告语义嵌入、基于用户画像的广告推荐。进而通过本地模型训练与数据加密来保障用户隐私安全，避免个人数据泄露。本文设计了基于BERT大型语言模型的个性化广告推荐实验，并采用真实用户数据进行验证。实验结果表明：基于BERT的广告推送能有效提升广告点击率与转化率；同时通过本地模型训练与隐私保护机制，可在一定程度上降低用户隐私泄露风险。

Abstract

arXiv:2505.04634v1 Announce Type: cross Abstract: The recent progress of using graph based encoding of crystal structures for high throughput material property prediction has been quite successful. However, using a single modality model prevents us from exploiting the advantages of an enhanced features space by combining different representations. Specifically, pre-trained Large language models(LLMs) can encode a large amount of knowledge which is beneficial for training of models. Moreover, the graph encoder is able to learn the local features while the text encoder is able to learn global information such as space group and crystal symmetry. In this work, we propose Material Multi-Modal Fusion(MatMMFuse), a fusion based model which uses a multi-head attention mechanism for the combination of structure aware embedding from the Crystal Graph Convolution Network (CGCNN) and text embeddings from the SciBERT model. We train our model in an end-to-end framework using data from the Materials Project Dataset. We show that our proposed model shows an improvement compared to the vanilla CGCNN and SciBERT model for all four key properties: formation energy, band gap, energy above hull and fermi energy. Specifically, we observe an improvement of 40% compared to the vanilla CGCNN model and 68% compared to the SciBERT model for predicting the formation energy per atom. Importantly, we demonstrate the zero shot performance of the trained model on small curated datasets of Perovskites, Chalcogenides and the Jarvis Dataset. The results show that the proposed model exhibits better zero shot performance than the individual plain vanilla CGCNN and SciBERT model. This enables researchers to deploy the model for specialized industrial applications where collection of training data is prohibitively expensive.

摘要

近年来，基于晶体结构图编码的高通量材料性质预测方法取得了显著进展。然而，单模态模型无法通过结合不同表征方式来利用增强特征空间的优势。具体而言，预训练大语言模型（LLMs）能够编码大量知识，这对模型训练大有裨益。此外，图编码器擅长学习局部特征，而文本编码器则能捕获空间群和晶体对称性等全局信息。本研究提出材料多模态融合模型（MatMMFuse），该融合模型采用多头注意力机制，将来自晶体图卷积网络（CGCNN）的结构感知嵌入与SciBERT模型的文本嵌入相结合。我们使用材料项目数据集的数据，以端到端框架训练模型。实验结果表明，在形成能、带隙、能量高于壳层以及费米能这四项关键性质预测上，本模型相比原始CGCNN和SciBERT模型均有提升。特别是在预测单原子形成能时，较原始CGCNN模型提升40%，较SciBERT模型提升68%。值得注意的是，我们在钙钛矿、硫族化合物小型精选数据集及Jarvis数据集上验证了训练模型的零样本性能。结果显示，所提模型比单独的原始CGCNN和SciBERT模型具有更好的零样本性能。这使得研究人员可将该模型应用于专业工业领域——在这些场景中，训练数据的采集成本往往极其高昂。

When Bad Data Leads to Good Models

Abstract

arXiv:2505.04741v1 Announce Type: cross Abstract: In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.

摘要

在大语言模型（LLM）预训练中，数据质量通常被认为决定模型质量。本文从训练前后协同设计的视角重新审视"质量"这一概念。具体而言，我们探讨了预训练阶段使用更多毒性数据可能增强后训练阶段的控制能力，从而最终降低模型输出毒性的可能性。首先，我们通过玩具实验研究数据构成如何影响表征空间中特征的几何分布。接着，通过对Olmo-1B模型在不同比例清洁数据与毒性数据下的控制实验，发现随着毒性数据比例增加，毒性概念会获得更少纠缠的线性表征。进一步研究表明，虽然毒性数据会提升基础模型的生成毒性，但也使得毒性更易被消除。在Toxigen和Real Toxicity Prompts数据集上的评估表明，当应用推理时干预（ITI）等去毒技术时，使用毒性数据训练的模型能在降低生成毒性和保持通用能力之间取得更好平衡。我们的研究结果表明，当考虑后训练环节时，劣质数据可能反而有助于构建优质模型。

Advancing Conversational Diagnostic AI with Multimodal Reasoning

Abstract

arXiv:2505.04653v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.

摘要

大型语言模型（LLMs）在开展诊断对话方面展现出巨大潜力，但现有评估主要局限于纯文本交互，这与远程医疗服务的实际需求存在偏差。即时通讯平台允许临床医生和患者在诊疗过程中无缝上传并讨论多模态医疗资料，然而LLMs在保持合格诊断对话其他属性的同时处理此类数据的能力尚未可知。本研究通过赋予清晰医疗智能探索系统（AMIE）收集解读多模态数据并在问诊中精准推理的新能力，提升了其对话式诊断与管理性能。基于Gemini 2.0 Flash构建的系统采用状态感知对话框架，通过反映患者状态和动态诊断的中间模型输出来控制对话流程。后续提问策略性地针对患者状态的不确定性展开，形成模拟资深临床医生的结构化多模态病史采集过程。在一项随机双盲OSCE风格研究中，我们将AMIE与初级保健医生（PCPs）通过聊天咨询患者演员进行对比。研究构建了105个评估场景，涵盖智能手机皮肤照片、心电图和临床文档PDF等多种形式的医疗资料，涉及不同病症和人群。评估标准包含多模态能力及其他临床重要维度，如病史采集、诊断准确性、管理推理、沟通能力和同理心。专家评估显示AMIE在多模态维度7/9项、非多模态维度29/32项（包括诊断准确性）上优于PCPs。结果表明多模态对话诊断AI取得明显进展，但实际应用转化仍需进一步研究。

A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models

Abstract

arXiv:2505.04784v1 Announce Type: cross Abstract: The emergence of Generative AI (Gen AI) and Large Language Models (LLMs) has enabled more advanced chatbots capable of human-like interactions. However, these conversational agents introduce a broader set of operational risks that extend beyond traditional cybersecurity considerations. In this work, we propose a novel, instrumented risk-assessment metric that simultaneously evaluates potential threats to three key stakeholders: the service-providing organization, end users, and third parties. Our approach incorporates the technical complexity required to induce erroneous behaviors in the chatbot--ranging from non-induced failures to advanced prompt-injection attacks--as well as contextual factors such as the target industry, user age range, and vulnerability severity. To validate our metric, we leverage Garak, an open-source framework for LLM vulnerability testing. We further enhance Garak to capture a variety of threat vectors (e.g., misinformation, code hallucinations, social engineering, and malicious code generation). Our methodology is demonstrated in a scenario involving chatbots that employ retrieval-augmented generation (RAG), showing how the aggregated risk scores guide both short-term mitigation and longer-term improvements in model design and deployment. The results underscore the importance of multi-dimensional risk assessments in operationalizing secure, reliable AI-driven conversational systems.

摘要

生成式人工智能（Gen AI）和大型语言模型（LLM）的出现使得能够实现更先进、具备类人交互能力的聊天机器人。然而，这些对话代理引入了超越传统网络安全考量的更广泛运营风险。本研究提出了一种新颖的、工具化的风险评估指标，可同时评估对三个关键利益相关方的潜在威胁：服务提供组织、终端用户及第三方。我们的方法综合了诱发聊天机器人错误行为所需的技术复杂性（从非诱导性故障到高级提示注入攻击），以及目标行业、用户年龄范围和漏洞严重性等情境因素。为验证该指标，我们利用开源框架Garak进行LLM漏洞测试，并进一步扩展其功能以捕获多种威胁向量（如错误信息、代码幻觉、社会工程和恶意代码生成）。通过采用检索增强生成（RAG）技术的聊天机器人场景，我们展示了该方法如何通过聚合风险评分指导短期风险缓解及长期模型设计与部署改进。研究结果强调了多维风险评估对于实现安全可靠的人工智能驱动对话系统运营的重要性。

QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving Query-By-Document Search Using LLM-Reranking with Reduced Human Effort

Abstract

arXiv:2505.04732v1 Announce Type: cross Abstract: The Query-By-Document (QBD) problem is an information retrieval problem where the query is a document, and the retrieved candidates are documents that match the query document, often in a domain or query specific manner. This can be crucial for tasks such as patent matching, legal or compliance case retrieval, and academic literature review. Existing retrieval methods, including keyword search and document embeddings, can be optimized with domain-specific datasets to improve QBD search performance. However, creating these domain-specific datasets is often costly and time-consuming. Our work introduces a process to generate custom QBD-search datasets and compares a set of methods to use in this problem, which we refer to as QBD-RankedDatagen. We provide a comparative analysis of our proposed methods in terms of cost, speed, and the human interface with the domain experts. The methods we compare leverage Large Language Models (LLMs) which can incorporate domain expert input to produce document scores and rankings, as well as explanations for human review. The process and methods for it that we present can significantly reduce human effort in dataset creation for custom domains while still obtaining sufficient expert knowledge for tuning retrieval models. We evaluate our methods on QBD datasets from the Text Retrieval Conference (TREC) and finetune the parameters of the BM25 model -- which is used in many industrial-strength search engines like OpenSearch -- using the generated data.

摘要

文档查询（Query-By-Document，QBD）是一种信息检索问题，其查询本身为文档形式，检索目标是获取与查询文档相匹配的候选文档，通常需结合特定领域或查询需求进行处理。该技术对专利匹配、法律合规案例检索及学术文献综述等任务至关重要。现有检索方法（包括关键词搜索和文档嵌入）可通过领域专用数据集优化以提升QBD搜索性能，但构建此类数据集往往成本高昂且耗时。本研究提出了一种生成定制化QBD搜索数据集的流程（称为QBD-RankedDatagen），并比较了适用于该问题的一系列方法。我们从成本、速度及领域专家的人机交互维度对所提方法进行了对比分析。这些方法利用大型语言模型（LLMs）整合领域专家输入，生成文档评分、排序结果及可人工审核的解释说明。我们提出的流程与方法能显著降低定制领域数据集构建中的人工投入，同时为检索模型调优获取足够的专家知识。基于文本检索会议（TREC）的QBD数据集，我们评估了所提方法，并利用生成数据对BM25模型（广泛应用于OpenSearch等工业级搜索引擎）的参数进行了微调。

REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM

Abstract

arXiv:2505.04673v1 Announce Type: cross Abstract: Vision Large Language Models (VLLMs) represent a significant advancement in artificial intelligence by integrating image-processing capabilities with textual understanding, thereby enhancing user interactions and expanding application domains. However, their increased complexity introduces novel safety and ethical challenges, particularly in multi-modal and multi-turn conversations. Traditional safety evaluation frameworks, designed for text-based, single-turn interactions, are inadequate for addressing these complexities. To bridge this gap, we introduce the REVEAL (Responsible Evaluation of Vision-Enabled AI LLMs) Framework, a scalable and automated pipeline for evaluating image-input harms in VLLMs. REVEAL includes automated image mining, synthetic adversarial data generation, multi-turn conversational expansion using crescendo attack strategies, and comprehensive harm assessment through evaluators like GPT-4o. We extensively evaluated five state-of-the-art VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, and Pixtral, across three important harm categories: sexual harm, violence, and misinformation. Our findings reveal that multi-turn interactions result in significantly higher defect rates compared to single-turn evaluations, highlighting deeper vulnerabilities in VLLMs. Notably, GPT-4o demonstrated the most balanced performance as measured by our Safety-Usability Index (SUI) followed closely by Pixtral. Additionally, misinformation emerged as a critical area requiring enhanced contextual defenses. Llama-3.2 exhibited the highest MT defect rate ( $16.55 \%$ ) while Qwen2-VL showed the highest MT refusal rate ( $19.1 \%$ ).

摘要

视觉大语言模型（VLLMs）通过整合图像处理能力与文本理解技术，显著推动了人工智能的发展，从而提升了用户交互体验并拓展了应用领域。然而，其复杂性的增加也带来了新的安全与伦理挑战，尤其在多模态多轮对话场景中。传统基于文本单轮交互的安全评估框架难以应对这些复杂问题。为此，我们提出REVEAL（视觉赋能AI大模型责任评估）框架——一个可扩展的自动化流程，用于评估VLLMs中的图像输入危害。该框架包含自动图像挖掘、合成对抗数据生成、基于渐进式攻击策略的多轮对话扩展，以及通过GPT-4o等评估器进行的全面危害分析。

我们对五款前沿VLLMs（GPT-4o、Llama-3.2、Qwen2-VL、Phi3.5V和Pixtral）在三大关键危害类别（性危害、暴力及错误信息）进行了深入评估。研究发现：相较于单轮评估，多轮交互会导致缺陷率显著上升，暴露出VLLMs更深层的脆弱性。值得注意的是，根据我们设计的安全-可用性指数（SUI）衡量，GPT-4o展现出最均衡的性能表现，Pixtral紧随其后。此外，错误信息被证明是需要加强上下文防御的关键领域。其中Llama-3.2的多轮缺陷率最高（16.55%），而Qwen2-VL的多轮拒绝率最高（19.1%）。

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Abstract

arXiv:2505.04842v1 Announce Type: cross Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL $^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL $^V$ boosts MATH accuracy by over 20% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL $^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL $^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

摘要

当前用于微调大型语言模型（LLM）推理器的强化学习（RL）方法（如GRPO或留一法PPO）通常会放弃已学习的价值函数，转而采用经验估计的回报。这种做法阻碍了依赖价值函数进行验证的测试时计算扩展。本研究提出RL $^V$ 方法，通过联合训练LLM作为推理器和生成式验证器（利用RL生成的数据），在不显著增加开销的情况下增强任何“无价值”RL方法的验证能力。实验表明，RL $^V$ 在并行采样条件下将MATH准确率提升超过20%，与基础RL方法相比可实现 $8-32\times$ 的测试时计算效率扩展。RL $^V$ 在易到难任务及域外任务中均表现出强大的泛化能力。此外，当与长推理R1模型联合扩展并行和顺序测试时计算时，RL $^V$ 能实现 $1.2-1.6\times$ 的性能提升。

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Abstract

arXiv:2505.04847v1 Announce Type: cross Abstract: Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.

摘要

幻觉问题仍是大型语言模型(LLM)面临的持续挑战。检索增强生成(RAG)技术试图通过将响应锚定于上下文来减少幻觉。然而即使提供上下文，LLMs仍频繁生成缺乏依据的信息或矛盾内容。本文重点研究摘要任务中的LLM幻觉测量，评估不同LLMs在文档摘要时产生幻觉的频率。我们基于Hughes幻觉评估模型(HHEM)讨论了Vectara现有的LLM幻觉排行榜。尽管HHEM和Vectara幻觉排行榜已引发广泛研究兴趣，我们仍通过分析现有幻觉数据集上这些方法的有效性，检验了HHEM及当前幻觉检测方法面临的挑战。针对这些局限，我们提出FaithJudge——一种基于少量人工幻觉标注指导的LLM-as-a-judge方法，相较现有方法显著提升了LLM幻觉自动评估效果。我们推出了以FaithJudge为核心的增强版幻觉排行榜，与现有排行榜并列呈现，从而为RAG场景下的LLM幻觉提供更可靠的基准评估体系。

HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights

Abstract

arXiv:2505.04846v1 Announce Type: cross Abstract: The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.

摘要

科学文献数量正呈指数级增长，这导致大量研究成果未被充分利用、科研工作重复以及跨学科合作受限。检索增强生成（RAG）技术通过提升大语言模型（LLMs）处理海量信息时的 factual 准确性，为科研人员提供了有效辅助。然而，将RAG扩展至处理数百万篇文献时面临重大挑战：包括文档解析与科学知识嵌入的高计算成本，以及将这些表征与科学内容复杂语义对齐的算法复杂性。为此，我们提出HiPerRAG——一种基于高性能计算（HPC）的RAG工作流，能够对超过360万篇科学文献进行知识索引与检索。其核心是Oreo（一个高通量多模态文档解析模型）和ColTrast（一种查询感知的编码器微调算法，通过对比学习与延迟交互技术提升检索精度）。HiPerRAG在现有科学问答基准及本文提出的两个新基准上表现优异：在SciQ达到90%准确率，在PubMedQA达到76%准确率，优于PubMedGPT等领域专用模型及GPT-4等商用LLMs。通过在Polaris、Sunspot和Frontier超级计算机上部署数千块GPU，HiPerRAG实现了百万级文献规模的RAG工作流，为整合科学知识与促进跨学科创新提供了解决方案。

GroverGPT-2: Simulating Grover's Algorithm via Chain-of-Thought Reasoning and Quantum-Native Tokenization

Abstract

arXiv:2505.04880v1 Announce Type: cross Abstract: Quantum computing offers theoretical advantages over classical computing for specific tasks, yet the boundary of practical quantum advantage remains an open question. To investigate this boundary, it is crucial to understand whether, and how, classical machines can learn and simulate quantum algorithms. Recent progress in large language models (LLMs) has demonstrated strong reasoning abilities, prompting exploration into their potential for this challenge. In this work, we introduce GroverGPT-2, an LLM-based method for simulating Grover's algorithm using Chain-of-Thought (CoT) reasoning and quantum-native tokenization. Building on its predecessor, GroverGPT-2 performs simulation directly from quantum circuit representations while producing logically structured and interpretable outputs. Our results show that GroverGPT-2 can learn and internalize quantum circuit logic through efficient processing of quantum-native tokens, providing direct evidence that classical models like LLMs can capture the structure of quantum algorithms. Furthermore, GroverGPT-2 outputs interleave circuit data with natural language, embedding explicit reasoning into the simulation. This dual capability positions GroverGPT-2 as a prototype for advancing machine understanding of quantum algorithms and modeling quantum circuit logic. We also identify an empirical scaling law for GroverGPT-2 with increasing qubit numbers, suggesting a path toward scalable classical simulation. These findings open new directions for exploring the limits of classical simulatability, enhancing quantum education and research, and laying groundwork for future foundation models in quantum computing.

摘要

量子计算在特定任务上具有超越经典计算的理论优势，但实际量子优势的边界仍是一个悬而未决的问题。为探究这一边界，理解经典机器是否及如何能够学习并模拟量子算法至关重要。大型语言模型（LLMs）近期展现出的强大推理能力，促使我们探索其应对这一挑战的潜力。本研究提出GroverGPT-2，这是一种基于LLM的方法，通过思维链（CoT）推理和量子原生标记化来模拟Grover算法。相较于前代模型，GroverGPT-2能直接从量子电路表示进行模拟，同时生成具有逻辑结构且可解释的输出。结果表明，GroverGPT-2能通过高效处理量子原生标记来学习并内化量子电路逻辑，这为LLM等经典模型可捕捉量子算法结构提供了直接证据。此外，GroverGPT-2的输出将电路数据与自然语言交织，将显式推理嵌入模拟过程。这种双重能力使GroverGPT-2成为推进机器理解量子算法和建模量子电路逻辑的原型。我们还发现了GroverGPT-2随量子比特数增加的经验缩放规律，为可扩展经典模拟指明了路径。这些发现为探索经典可模拟性极限、加强量子教育与研究，以及构建未来量子计算基础模型开辟了新方向。

ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning

Abstract

arXiv:2505.04881v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience. Existing compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to intervene effectively during generation. In this work, we introduce a confidence-guided perspective to explain the emergence of redundant reflection in LRMs, identifying two key patterns: Confidence Deficit, where the model reconsiders correct steps due to low internal confidence, and Termination Delay, where reasoning continues even after reaching a confident answer. Based on this analysis, we propose ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains by reinforcing the model's confidence during inference, thus preventing the generation of redundant reflection steps. It integrates Confidence Injection to stabilize intermediate steps and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that fine-tuning LRMs on ConCISE-generated data yields significantly shorter outputs, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy. ConCISE consistently outperforms existing baselines across multiple reasoning benchmarks.

摘要

大规模推理模型（LRMs）通过思维链（CoT）提示在复杂推理任务中表现优异，但常因冗余内容导致输出冗长，从而增加计算开销并降低用户体验。现有压缩方法要么采用事后剪枝，可能破坏推理连贯性；要么依赖基于采样的选择，无法在生成过程中有效干预。本研究提出一种置信度引导的视角来解释LRMs中冗余反思的产生，识别出两种关键模式：置信赤字（模型因内部置信度低而重新考虑正确步骤）和终止延迟（模型在获得置信答案后仍持续推理）。基于此分析，我们提出ConCISE（逐步高效推理中的置信度引导压缩框架），通过增强模型推理过程中的置信度来简化推理链，从而避免生成冗余反思步骤。该框架整合置信注入（稳定中间步骤）和早期终止（当置信度充足时停止推理）。大量实验表明，基于ConCISE生成数据微调的LRMs能显著缩短输出长度（在SimPO下减少约50%），同时保持高任务准确率。ConCISE在多个推理基准测试中 consistently 优于现有基线方法。

SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

Abstract

arXiv:2505.04911v1 Announce Type: cross Abstract: This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes intuitive visual and positional cues but also achieves state-of-the-art zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across several metrics. The proposed method effectively eliminates the need for specialized 3D inputs and fine-tuning, offering a simpler and more scalable alternative to conventional approaches.

摘要

本研究提出了一种名为SpatialPrompting的创新框架，该框架利用现成多模态大语言模型涌现的推理能力，实现了三维（3D）环境中的零样本空间推理。与现有方法依赖昂贵的3D专用微调（需使用点云或体素特征等专业3D输入）不同，SpatialPrompting采用关键帧驱动的提示生成策略。该框架通过视觉语言相似度、马氏距离、视场角和图像清晰度等指标，从图像序列中选择多样化的信息关键帧，并将其与对应相机位姿数据整合，从而有效抽象空间关系并推断复杂3D结构。该框架不仅建立了利用直观视觉与位置线索进行灵活空间推理的新范式，还在ScanQA和SQA3D等基准数据集上实现了多项指标的零样本最先进性能。所提方法彻底消除了对专业3D输入和微调的依赖，为传统方法提供了更简单且可扩展的替代方案。

An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education

Abstract

arXiv:2505.04916v1 Announce Type: cross Abstract: Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.

摘要

人工智能的最新进展推动了智能教育工具的普及，但现有语义检索系统仍难以适应学术内容独特的语言与结构特征。本研究提出两种专为教育问答任务优化的开源嵌入模型，特别针对课程大纲场景。通过人工筛选与大语言模型辅助生成相结合，构建了包含3,197个句对的合成数据集，涵盖同义术语、问题复述及显隐式映射三类语义关系。评估了两种训练策略：(1)采用多重负样本排序损失(MNRL)的基线模型，(2)融合MNRL与余弦相似度损失的双损失模型以同步优化语义排序与相似度校准。基于28份大学课程大纲及预设的自然语言问题集（课程信息、教师信息、助教信息三类）的测试表明：两种微调模型均优于all-MiniLM-L6-v2和multi-qa-MiniLM-L6-cos-v1等强开源基线，且双损失模型缩小了与OpenAI text-embedding-3系列等高性能商业嵌入的差距。本研究贡献了可复用的领域适配嵌入模型，并为教育语义检索提供了可复制的技术框架，可支持学术聊天机器人、检索增强生成(RAG)系统及学习管理系统(LMS)集成等下游应用。

Chain-of-Thought Tokens are Computer Program Variables

Abstract

arXiv:2505.04955v1 Announce Type: cross Abstract: Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at https://github.com/solitaryzero/CoTs_are_Variables.

摘要

思维链（CoT）要求大语言模型（LLM）在得出最终答案前生成中间步骤，已被证明能有效帮助LLM解决复杂推理任务。然而，CoT的内在机制仍不甚明晰。本文通过实证研究，探讨了LLM中CoT标记在两个组合任务中的作用：多位数乘法与动态规划。虽然CoT对解决这些问题至关重要，但我们发现仅保留存储中间结果的标记即可实现相当的性能。此外，我们观察到以替代潜在形式存储中间结果不会影响模型表现。通过随机干预CoT中的部分数值，我们注意到后续CoT标记及最终答案会相应改变。这些发现表明，CoT标记可能类似于计算机程序中的变量，但也存在潜在缺陷，如无意形成的捷径以及标记间计算复杂度的限制。代码与数据详见https://github.com/solitaryzero/CoTs_are_Variables。

LVLM-MPC Collaboration for Autonomous Driving: A Safety-Aware and Task-Scalable Control Architecture

Abstract

arXiv:2505.04980v1 Announce Type: cross Abstract: This paper proposes a novel Large Vision-Language Model (LVLM) and Model Predictive Control (MPC) integration framework that delivers both task scalability and safety for Autonomous Driving (AD). LVLMs excel at high-level task planning across diverse driving scenarios. However, since these foundation models are not specifically designed for driving and their reasoning is not consistent with the feasibility of low-level motion planning, concerns remain regarding safety and smooth task switching. This paper integrates LVLMs with MPC Builder, which automatically generates MPCs on demand, based on symbolic task commands generated by the LVLM, while ensuring optimality and safety. The generated MPCs can strongly assist the execution or rejection of LVLM-driven task switching by providing feedback on the feasibility of the given tasks and generating task-switching-aware MPCs. Our approach provides a safe, flexible, and adaptable control framework, bridging the gap between cutting-edge foundation models and reliable vehicle operation. We demonstrate the effectiveness of our approach through a simulation experiment, showing that our system can safely and effectively handle highway driving while maintaining the flexibility and adaptability of LVLMs.

摘要

本文提出了一种新颖的大型视觉语言模型（LVLM）与模型预测控制（MPC）集成框架，旨在为自动驾驶（AD）同时实现任务可扩展性和安全性。LVLM擅长处理多样化驾驶场景中的高层任务规划，但由于这些基础模型并非专为驾驶设计，其推理过程与底层运动规划的可行性存在不一致性，因此在安全性和平滑任务切换方面仍存在隐患。本研究将LVLM与MPC构建器相结合，该系统能根据LVLM生成的符号化任务指令自动生成MPC控制器，同时确保最优性与安全性。生成的MPC控制器通过提供任务可行性反馈及生成支持任务切换的MPC方案，能够有效辅助执行或否决LVLM驱动的任务切换。该框架构建了一个安全、灵活且适应性强的控制体系，弥合了前沿基础模型与可靠车辆操作之间的鸿沟。通过仿真实验验证，我们的系统在保持LVLM灵活性与适应性的同时，能够安全高效地完成高速公路驾驶任务。

Rethinking Invariance in In-context Learning

Abstract

arXiv:2505.04994v1 Announce Type: cross Abstract: In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/InvICL.

摘要

上下文学习（ICL）已成为自回归大语言模型的核心能力，但其对上下文示例顺序的显著敏感性阻碍了发展，即使这些示例相互独立。为解决该问题，近期研究提出了多种实现排列不变性的ICL变体算法，然而其中多数未能达到标准自回归ICL算法的可比性能。本研究发现，构建不变性ICL算法需满足两个关键要素：信息无泄漏和上下文互依性，现有方法均未能同时实现这两点。基于此，我们提出不变性上下文学习（InvICL）方法，该设计在保证两个特性的同时实现ICL不变性。实验表明，InvICL在多数基准数据集上超越以往不变性与非不变性模型，并展现出对不同输入长度的卓越泛化能力。代码发布于https://github.com/PKU-ML/InvICL。

Understanding In-context Learning of Addition via Activation Subspaces

Abstract

arXiv:2505.05145v1 Announce Type: cross Abstract: To perform in-context learning, language models must extract signals from individual few-shot examples, aggregate these into a learned prediction rule, and then apply this rule to new examples. How is this implemented in the forward pass of modern transformer models? To study this, we consider a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We find that Llama-3-8B attains high accuracy on this task for a range of $k$ , and localize its few-shot ability to just three attention heads via a novel optimization approach. We further show the extracted signals lie in a six-dimensional subspace, where four of the dimensions track the unit digit and the other two dimensions track overall magnitude. We finally examine how these heads extract information from individual few-shot examples, identifying a self-correction mechanism in which mistakes from earlier examples are suppressed by later examples. Our results demonstrate how tracking low-dimensional subspaces across a forward pass can provide insight into fine-grained computational structures.

摘要

为实现上下文学习，语言模型必须从少量示例中提取信号，将其聚合为学习到的预测规则，并将该规则应用于新样本。现代Transformer模型的前向传播如何实现这一过程？为此，我们研究了一个结构化的小样本学习任务族，其真实预测规则是对输入加上整数 $k$ 。研究发现Llama-3-8B模型在 $k$ 取值范围内均能实现高准确率，并通过新型优化方法将其小样本能力定位至仅三个注意力头。进一步研究表明，提取的信号存在于六维子空间中，其中四维跟踪个位数，另两维跟踪整体量级。最后我们分析了这些注意力头如何从单个小样本中提取信息，发现了一种自我校正机制：后续样本会抑制前期样本的错误。这些结果证明，通过追踪前向传播过程中的低维子空间，可以揭示细粒度计算结构的运作机制。

Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models

Abstract

arXiv:2505.05189v1 Announce Type: cross Abstract: Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06% in base classes and 75.97% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20%, 3.78%, and 8.04%, respectively. Our code are available at \underline{https://github.com/Kanyooo/Biomed-DPT}.

摘要

提示学习是在小样本场景下将预训练视觉-语言模型（VLM）适配到生物医学图像分类任务中最有效的范式之一。然而当前大多数提示学习方法仅使用文本提示，忽略了生物医学图像中特有的结构（如复杂的解剖结构和细微的病理特征）。本研究提出Biomed-DPT——一种知识增强的双模态提示调优技术。在文本提示设计方面，Biomed-DPT构建了包含模板驱动的临床提示和大型语言模型（LLM）驱动的领域适配提示的双重提示，并通过知识蒸馏技术从领域适配提示中提取临床知识。在视觉提示设计方面，Biomed-DPT引入零向量作为软提示以利用注意力重加权机制，从而避免对非诊断区域的关注和非关键病理特征的识别。Biomed-DPT在涵盖9种模态和10个器官的11个生物医学图像数据集上实现了66.14%的平均分类准确率，其中基类性能达78.06%，新类性能达75.97%，分别超过上下文优化（CoOp）方法6.20%、3.78%和8.04%。代码已开源于https://github.com/Kanyooo/Biomed-DPT。

Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design

Abstract

arXiv:2505.05298v1 Announce Type: cross Abstract: In this position paper, we advocate for the development of conversational technology that is inherently designed to support and facilitate argumentative processes. We argue that, at present, large language models (LLMs) are inadequate for this purpose, and we propose an ideal technology design aimed at enhancing argumentative skills. This involves re-framing LLMs as tools to exercise our critical thinking rather than replacing them. We introduce the concept of 'reasonable parrots' that embody the fundamental principles of relevance, responsibility, and freedom, and that interact through argumentative dialogical moves. These principles and moves arise out of millennia of work in argumentation theory and should serve as the starting point for LLM-based technology that incorporates basic principles of argumentation.

摘要

在本立场文件中，我们主张开发一种本质设计用于支持和促进论证过程的对话技术。我们认为当前的大型语言模型（LLMs）尚不足以实现这一目标，并提出了一种旨在提升论证技能的理想技术设计方案。这涉及将LLMs重新定位为锻炼批判性思维的工具而非替代品。我们引入了"理性鹦鹉"的概念，这些模型体现相关性、责任性和自由性三大基本原则，并通过论证性对话行为进行交互。这些原则和行为源自数千年的论证理论研究成果，应当作为融入基础论证原则的LLM技术的设计起点。

Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Abstract

arXiv:2505.05190v1 Announce Type: cross Abstract: Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

摘要

文本水印技术旨在通过控制大语言模型（LLM）的采样过程，将统计信号微妙地嵌入文本中，使水印检测器能够验证输出是否由指定模型生成。这些水印算法的鲁棒性已成为评估其有效性的关键因素。当前文本水印算法通常在高熵标记中嵌入水印以确保文本质量。本文揭示，这种看似无害的设计可能被攻击者利用，对水印的鲁棒性构成重大威胁。我们提出了一种通用高效的改写攻击方法——自信息重写攻击（SIRA），该方法通过计算每个标记的自信息来识别潜在的模式标记并实施针对性攻击，从而利用这一漏洞。我们的工作暴露了当前水印算法中普遍存在的脆弱性。实验结果表明，SIRA对七种最新水印方法的攻击成功率接近100%，且每百万标记成本仅为0.88美元。该方法无需访问水印算法或带水印的LLM，并可无缝迁移至任何LLM（包括移动端模型）作为攻击模型。本研究凸显了开发更强健水印技术的紧迫性。

Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents

Abstract

arXiv:2505.05283v1 Announce Type: cross Abstract: Code large language models (CodeLLMs) and agents have shown great promise in tackling complex software engineering tasks.Compared to traditional software engineering methods, CodeLLMs and agents offer stronger abilities, and can flexibly process inputs and outputs in both natural and code. Benchmarking plays a crucial role in evaluating the capabilities of CodeLLMs and agents, guiding their development and deployment. However, despite their growing significance, there remains a lack of comprehensive reviews of benchmarks for CodeLLMs and agents. To bridge this gap, this paper provides a comprehensive review of existing benchmarks for CodeLLMs and agents, studying and analyzing 181 benchmarks from 461 relevant papers, covering the different phases of the software development life cycle (SDLC). Our findings reveal a notable imbalance in the coverage of current benchmarks, with approximately 60% focused on the software development phase in SDLC, while requirements engineering and software design phases receive minimal attention at only 5% and 3%, respectively. Additionally, Python emerges as the dominant programming language across the reviewed benchmarks. Finally, this paper highlights the challenges of current research and proposes future directions, aiming to narrow the gap between the theoretical capabilities of CodeLLMs and agents and their application in real-world scenarios.

摘要

代码大语言模型（CodeLLMs）与智能体在解决复杂软件工程任务方面展现出巨大潜力。相较于传统软件工程方法，CodeLLMs和智能体具备更强大的能力，能够灵活处理自然语言与代码的输入输出。基准测试对评估CodeLLMs与智能体的能力至关重要，可指导其开发与部署。然而尽管其重要性日益凸显，目前仍缺乏对CodeLLMs与智能体基准测试的系统性综述。为填补这一空白，本文对现有CodeLLMs与智能体基准测试进行了全面梳理，从461篇相关文献中研究分析了181个基准测试，覆盖软件开发生命周期（SDLC）各阶段。研究发现当前基准测试存在显著不均衡现象：约60%集中于SDLC的软件开发阶段，而需求工程与软件设计阶段仅分别占5%和3%。此外，Python在所有被分析的基准测试中占据主导地位。最后，本文指出现有研究的挑战并展望未来方向，旨在缩小CodeLLMs与智能体理论能力同实际应用场景之间的差距。

Reasoning Models Don't Always Say What They Think

Abstract

arXiv:2505.05410v1 Announce Type: cross Abstract: Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.

摘要

链式思考（CoT）为AI安全提供了潜在优势，因其允许通过监控模型的CoT来理解其意图和推理过程。然而，这种监控的有效性取决于CoT能否真实反映模型的实际推理过程。我们评估了最先进推理模型在提示中包含6种推理线索时的CoT忠实度，发现：（1）在大多数测试场景和模型中，CoT至少能在1%使用线索的案例中揭示其线索使用情况，但揭示率通常低于20%；（2）基于结果的强化学习初期能提升忠实度，但会进入平台期而无法达到饱和；（3）当强化学习增加线索使用频率（奖励破解）时，即使未针对CoT监控进行训练，其线索语言化倾向也不会提升。这些结果表明，CoT监控是发现训练和评估过程中不良行为的有前途的方法，但尚不足以完全规避此类行为。研究还表明，在我们这类无需CoT推理的场景中，测试时对CoT的监控难以可靠捕捉罕见且灾难性的意外行为。

Scalable Chain of Thoughts via Elastic Reasoning

Abstract

arXiv:2505.05315v1 Announce Type: cross Abstract: Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases--thinking and solution--with independently allocated budgets. At test time, Elastic Reasoning prioritize that completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Elastic Reasoning offers a principled and practical solution to the pressing challenge of controllable reasoning at scale.

摘要

大型推理模型（LRMs）通过生成扩展的思维链（CoT）在复杂任务上取得了显著进展。然而，其不受控制的输出长度对实际部署提出了重大挑战，特别是在推理时存在严格的令牌、延迟或计算资源预算限制的场景。我们提出弹性推理框架，这是一种可扩展思维链的新方法，明确将推理过程分离为两个阶段——思考阶段和解答阶段，并分别分配独立预算。在测试时，弹性推理优先保证解答片段的完整性，从而在严格资源限制下显著提高可靠性。为训练具有截断思考鲁棒性的模型，我们提出一种轻量级的预算约束展开策略，该策略与GRPO框架集成，教导模型在思考过程中断时进行自适应推理，并能有效泛化至未见过的预算约束而无需额外训练。在数学（AIME、MATH500）和编程（LiveCodeBench、Codeforces）基准测试上的实证结果表明，弹性推理在严格预算约束下表现稳健，同时训练成本显著低于基线方法。值得注意的是，即使在无约束环境下，该方法也能产生更简洁高效的推理过程。弹性推理为大规模可控推理这一紧迫挑战提供了原则性且实用的解决方案。

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

Abstract

arXiv:2505.05467v1 Announce Type: cross Abstract: We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.

摘要

我们提出StreamBridge，这是一个简单而高效的框架，能够将离线视频大语言模型无缝转换为支持流式处理的模型。该框架解决了现有模型适应在线场景时的两个核心挑战：(1) 多轮实时理解能力有限；(2) 缺乏主动响应机制。具体而言，StreamBridge包含：(1) 结合轮次衰减压缩策略的记忆缓冲区，支持长上下文多轮交互；(2) 可轻松集成到现有视频大语言模型中的解耦轻量级激活模型，实现持续主动响应。为支持StreamBridge，我们构建了Stream-IT数据集——一个专为流式视频理解定制的大规模数据集，包含交错排列的视频-文本序列和多样化的指令格式。大量实验表明，StreamBridge显著提升了离线视频大语言模型在各种任务中的流式理解能力，甚至超越了GPT-4o和Gemini 1.5 Pro等专有模型。同时，该框架在标准视频理解基准测试中达到了具有竞争力或更优的性能表现。

Crosslingual Reasoning through Test-Time Scaling

Abstract

arXiv:2505.05408v1 Announce Type: cross Abstract: Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.

摘要

大型语言模型的推理能力研究主要集中于英语，即便预训练模型本身是多语言的。本研究探讨了基于长思维链（CoTs）的英语推理微调在跨语言泛化中的效果。首先，我们发现针对英语优化的推理语言模型（RLMs）通过增加推理计算规模，能显著提升包括低资源语言在内的多语言数学推理能力，其表现甚至可超越两倍规模的模型。其次，尽管英语中心化RLMs生成的思维链主要使用英语，但它们会持续采用"引用-思考"模式来处理非英语输入内容。第三，我们提出了一种有效控制长思维链推理语言的策略，并观察到模型在高资源语言中表现出更高效、更优质的推理能力。最后，研究发现模型在跨领域推理泛化（特别是从STEM领域到文化常识领域）表现欠佳，英语场景亦不例外。总体而言，我们揭示了英语推理测试时扩展的跨语言泛化潜力，分析了其作用机制，并界定了局限性。结论表明：实践者应让英语中心化RLMs使用高资源语言进行推理，同时仍需进一步研究以提升低资源语言和跨领域场景的推理能力。

ComPO: Preference Alignment via Comparison Oracles

Abstract

arXiv:2505.05465v1 Announce Type: cross Abstract: Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on comparison oracles and provide the convergence guarantee for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in \citet{Razin-2025-Unintentional}.

摘要

直接对齐方法正日益广泛地用于将大型语言模型（LLMs）与人类偏好对齐。然而，这些方法存在冗长性和似然偏移问题，这些问题可能由噪声偏好对引起——这些偏好对会使优选和非优选响应产生相似的似然值。本文的贡献有两点：首先，我们提出了一种基于比较预言机的新偏好对齐方法，并为其基本方案提供了收敛性保证；其次，我们通过启发式方法改进该方案，并通过实验证明实用方案在利用噪声偏好对提升LLMs性能时具有灵活性和兼容性。我们在多个基础模型和指令调优模型（Mistral-7B、Llama-3-8B和Gemma-2-9B）上使用基准测试（AlpacaEval 2、MT-Bench和Arena-Hard）进行评估。实验结果表明，我们的方法能有效替代现有直接对齐方法以解决其局限性。本研究的一个重要发现是：我们证实了针对具有显著似然差异的偏好对设计专用方法的必要性，这补充了\citet{Razin-2025-Unintentional}的最新研究成果。

TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering

Abstract

arXiv:2505.05423v1 Announce Type: cross Abstract: The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation (MT) as being superior to experienced professional human translation. In the long run, this bias could result in a permanent decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce TransProQA, a novel, reference-free, LLM-based question-answering (QA) framework designed specifically for literary translation evaluation. TransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, TransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation (ACC-EQ and Kendall's tau) and surpassing the best state-of-the-art (SOTA) metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, TransProQA approaches human-level evaluation performance comparable to trained linguistic annotators. It demonstrates broad applicability to open-source models such as LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free literary evaluation metric and a valuable tool for evaluating texts that require local processing due to copyright or ethical considerations.

摘要

大语言模型（LLMs）的影响已延伸至文学领域。然而，现有评估指标更注重机械准确性而非艺术表达，且往往高估机器翻译（MT），认为其优于经验丰富的专业人工翻译。长期来看，这种偏见可能导致翻译质量和文化真实性的永久性下降。针对当前对专业化文学评估指标的迫切需求，我们提出TransProQA——一种新型的、无参考的、基于LLM的问答（QA）框架，专为文学翻译评估设计。TransProQA创新性地整合了专业文学译者和研究者的洞见，重点关注文学质量评估中的关键要素，如文学手法、文化理解和作者风格。我们的广泛评估表明，虽然经过文学微调的XCOMET-XL仅带来边际提升，但TransProQA显著优于现有指标，在相关性（ACC-EQ和Kendall's tau）上最高提升0.07，并在充分性评估中超过最佳前沿（SOTA）指标15分以上。将专业译者的见解作为权重进一步提升了性能，凸显了译者输入的价值。值得注意的是，TransProQA达到了与受过训练的语言标注者相当的人类级评估性能。该框架对LLaMA3.3-70b和Qwen2.5-32b等开源模型展现出广泛适用性，表明其有望成为一种易用且无需训练的文学评估指标，同时也是评估因版权或伦理问题需本地处理文本的有力工具。

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

Abstract

arXiv:2504.00762v4 Announce Type: replace Abstract: This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on six datasets demonstrate that our strategy not only outperforms self-consistency and state-of-the-art multi-agent debate approaches, but also significantly reduces inference costs. Additionally, ModelSwitch requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation-verification paradigm.

摘要

本文提出了一种简单、有效且经济高效的策略，通过扩展测试时计算来提升大语言模型性能。该策略基于重复采样后投票的框架，并引入创新思路：整合多个模型（包括较弱模型），以利用其可能源自不同训练数据和范式的互补优势。通过将一致性作为信号，我们的策略实现了模型间的动态切换。理论分析揭示了该策略在效率和性能上的优势。在六个数据集上的大量实验表明，本策略不仅优于自一致性和最先进的多智能体辩论方法，还能显著降低推理成本。此外，ModelSwitch仅需少量可比大语言模型即可实现最佳性能，并可通过验证方法进行扩展，这展现了在生成-验证范式中利用多个大语言模型的潜力。

A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law

Abstract

arXiv:2505.02665v2 Announce Type: replace Abstract: This survey explores recent advancements in reasoning large language models (LLMs) designed to mimic "slow thinking" - a reasoning process inspired by human cognition, as described in Kahneman's Thinking, Fast and Slow. These models, like OpenAI's o1, focus on scaling computational resources dynamically during complex tasks, such as math reasoning, visual reasoning, medical diagnosis, and multi-agent debates. We present the development of reasoning LLMs and list their key technologies. By synthesizing over 100 studies, it charts a path toward LLMs that combine human-like deep thinking with scalable efficiency for reasoning. The review breaks down methods into three categories: (1) test-time scaling dynamically adjusts computation based on task complexity via search and sampling, dynamic verification; (2) reinforced learning refines decision-making through iterative improvement leveraging policy networks, reward models, and self-evolution strategies; and (3) slow-thinking frameworks (e.g., long CoT, hierarchical processes) that structure problem-solving with manageable steps. The survey highlights the challenges and further directions of this domain. Understanding and advancing the reasoning abilities of LLMs is crucial for unlocking their full potential in real-world applications, from scientific discovery to decision support systems.

Generating Symbolic World Models via Test-time Scaling of Large Language Models

Abstract

arXiv:2502.04728v2 Announce Type: replace Abstract: Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domains, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.

摘要

解决复杂规划问题需要大型语言模型（LLM）显式建模状态转移以避免规则违反、满足约束条件并确保最优性——这一任务因自然语言固有的模糊性而受阻。为克服此类模糊性，本研究利用规划领域定义语言（PDDL）作为规划抽象工具，实现精确的形式化状态描述。借助PDDL，我们可以构建符号化世界模型，从而无缝应用A*等经典搜索算法来寻找最优规划方案。然而，由于缺乏PDDL训练数据，当前LLM直接生成PDDL域仍存在挑战。为此，我们提出通过扩展LLM的测试时计算来增强其PDDL推理能力，从而实现高质量PDDL域的生成。具体而言，我们设计了一种简单有效的算法：首先采用最佳N采样策略提升初始解质量，继而通过语言化机器学习进行细粒度优化。该方法在PDDL域生成任务上显著优于o1-mini模型，在两项任务（即从自然语言描述或PDDL问题生成PDDL域）中成功率超过50%，且无需额外训练。通过利用PDDL作为状态抽象工具，我们的方法在几乎所有竞赛级规划任务上均优于当前最先进方法。

Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation

Abstract

arXiv:2504.15699v2 Announce Type: replace Abstract: Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafetyBench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt-decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance.

摘要

具身智能体在众多领域展现出巨大潜力，确保其行为安全性成为广泛部署的基本前提。然而现有研究主要集中于通用大语言模型的安全性，缺乏针对具身智能体建立安全基准和输入审核的专门方法。为填补这一空白，本文提出一个新颖的输入审核框架，专为保障具身智能体安全而精心设计。该框架涵盖完整流程，包括分类体系定义、数据集构建、审核器架构、模型训练和严格评估。值得注意的是，我们提出了EAsafetyBench——一个精心设计的安全基准，专门用于具身智能体审核器的训练和严格评估。此外，我们提出Pinpoint方案，这种创新的提示解耦输入审核机制利用掩码注意力机制，有效隔离并削弱功能提示对审核任务的影响。在多样化基准数据集和模型上进行的大量实验验证了所提方法的可行性和有效性。结果表明，我们的方法实现了94.58%的平均检测准确率，超越现有最先进技术，同时保持每实例仅0.002秒的卓越审核处理速度。

MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind

Abstract

arXiv:2504.18039v2 Announce Type: replace Abstract: Large Language Model (LLM) agents have demonstrated impressive capabilities in social deduction games (SDGs) like Werewolf, where strategic reasoning and social deception are essential. However, current approaches remain limited to textual information, ignoring crucial multimodal cues such as facial expressions and tone of voice that humans naturally use to communicate. Moreover, existing SDG agents primarily focus on inferring other players' identities without modeling how others perceive themselves or fellow players. To address these limitations, we use One Night Ultimate Werewolf (ONUW) as a testbed and present MultiMind, the first framework integrating multimodal information into SDG agents. MultiMind processes facial expressions and vocal tones alongside verbal content, while employing a Theory of Mind (ToM) model to represent each player's suspicion levels toward others. By combining this ToM model with Monte Carlo Tree Search (MCTS), our agent identifies communication strategies that minimize suspicion directed at itself. Through comprehensive evaluation in both agent-versus-agent simulations and studies with human players, we demonstrate MultiMind's superior performance in gameplay. Our work presents a significant advancement toward LLM agents capable of human-like social reasoning across multimodal domains.

摘要

大型语言模型（LLM）智能体在需要战略推理与社交欺骗的狼人等社交推理游戏（SDG）中展现出卓越能力，但现有方法仅局限于文本信息，忽略了人类自然交流中的面部表情、语音语调等多模态关键线索。此外，当前SDG智能体主要聚焦于推断其他玩家身份，而未能建模玩家彼此间的认知状态。为突破这些限制，我们以《一夜终极狼人》（ONUW）为实验平台，提出首个整合多模态信息的SDG框架MultiMind。该框架通过同步处理语言内容、面部表情与语音特征，并采用心理理论（ToM）模型量化玩家间的相互怀疑程度，结合蒙特卡洛树搜索（MCTS）来制定最小化自身嫌疑的沟通策略。在智能体对抗模拟与人类玩家实验的综合评估中，MultiMind均展现出显著优势。本研究为实现跨多模态人类级社交推理的LLM智能体提供了重要突破。

Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems

Abstract

arXiv:2502.07503v4 Announce Type: replace Abstract: Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time in language and multimodal systems. RINS is a particular form of recursive depth that significantly outperforms +55 other variants, including the recent "repeat-all-over" (RAO) strategy in Mobile LLM (Liu et al., 2024) and latent recurrent thinking (Geiping et al., 2025). Unlike prior works, we carry out our comparisons on a compute-matched regime, and demonstrate that for a fixed model size and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. More importantly, with light-weight (linear) adapters (comprising <1% of model parameters) and stochastic dropout, RINS offers a no-regret strategy, meaning that RINS-enabled pretraining improves performance in language modeling even when recursive depth is not applied at inference time. This corresponds to improving performance on a training compute-, parameter-, and inference-matched regime, suggesting its potential as a viable component of LLM pretraining!

摘要

受近期关于语言分形几何结构研究的启发，我们提出递归推理扩展（RINS）作为语言和多模态系统中扩展推理时间的补充性插件方案。RINS是一种特殊形式的递归深度策略，其性能显著优于其他55种变体方案，包括Mobile LLM（Liu等人，2024）提出的"全重复"（RAO）策略和潜在循环思维（Geiping等人，2025）。与先前研究不同，我们在计算匹配机制下进行对比实验，证明在固定模型规模和训练计算预算条件下，RINS能显著提升语言建模性能。该方法还适用于纯语言任务之外的领域，在多模态系统中同样带来性能提升，例如使SigLIP-B/16模型在ImageNet零样本识别准确率提升2%。通过推导数据缩放定律，我们发现RINS同时改进了渐近性能极限和缩放指数。更重要的是，通过采用轻量级（线性）适配器（占模型参数<1%）和随机丢弃技术，RINS实现了无悔策略——即使推理时不应用递归深度，启用RINS的预训练仍能提升语言建模性能。这意味着在训练计算量、参数量和推理成本相匹配的条件下实现性能提升，表明其有望成为大语言模型预训练的有效组件。

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Abstract

arXiv:2406.17746v2 Announce Type: replace-cross Abstract: Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.

摘要

语言模型中的记忆通常被视为同质化现象，忽略了被记忆数据的具体特性。我们提出将记忆建模为描述每个样本并将其与模型及语料库相关联的一组复杂因素的综合效应。为理解这些因素，我们将记忆分解为分类体系：高度重复序列的复述、固有可预测序列的重构以及两者皆非序列的回想。通过构建记忆预测模型，我们验证了该分类体系的有效性。通过分析依赖关系并检验预测模型的权重，我们发现不同因素对记忆可能性的影响会因分类类别而异。

HORAE: A Domain-Agnostic Language for Automated Service Regulation

Abstract

arXiv:2406.06600v4 Announce Type: replace-cross Abstract: Artificial intelligence is rapidly encroaching on the field of service regulation. However, existing AI-based regulation techniques are often tailored to specific application domains and thus are difficult to generalize in an automated manner. This paper presents Horae, a unified specification language for modeling (multimodal) regulation rules across a diverse set of domains. We showcase how Horae facilitates an intelligent service regulation pipeline by further exploiting a fine-tuned large language model named RuleGPT that automates the Horae modeling process, thereby yielding an end-to-end framework for fully automated intelligent service regulation. The feasibility and effectiveness of our framework are demonstrated over a benchmark of various real-world regulation domains. In particular, we show that our open-sourced, fine-tuned RuleGPT with 7B parameters suffices to outperform GPT-3.5 and perform on par with GPT-4o.

摘要

人工智能正迅速渗透到服务监管领域。然而现有基于AI的监管技术通常针对特定应用领域定制，难以实现自动化泛化。本文提出Horae——一种跨领域（多模态）监管规则建模的统一规范语言。我们通过进一步开发名为RuleGPT的微调大语言模型来自动化Horae建模过程，从而构建全自动智能服务监管的端到端框架，展示了Horae如何赋能智能服务监管流程。基于多领域真实监管基准的实验验证了该框架的可行性与有效性。特别地，我们证明开源的70亿参数微调版RuleGPT已足以超越GPT-3.5，并与GPT-4o达到同等性能水平。

Enhancing Differential Testing With LLMs For Testing Deep Learning Libraries

Abstract

arXiv:2406.07944v2 Announce Type: replace-cross Abstract: Differential testing offers a promising strategy to alleviate the test oracle problem by comparing the test results between alternative implementations. However, existing differential testing techniques for deep learning (DL) libraries are limited by the key challenges of finding alternative implementations (called counterparts) for a given API and subsequently generating diverse test inputs. To address the two challenges, this paper introduces DLLens, an LLM-enhanced differential testing technique for DL libraries. To address the first challenge, DLLens incorporates an LLM-based counterpart synthesis workflow, with the insight that the counterpart of a given DL library API's computation could be successfully synthesized through certain composition and adaptation of the APIs from another DL library. To address the second challenge, DLLens incorporates a static analysis technique that extracts the path constraints from the implementations of a given API and its counterpart to guide diverse test input generation. The extraction is facilitated by LLM's knowledge of the concerned DL library and its upstream libraries. We evaluate DLLens on two popular DL libraries, TensorFlow and PyTorch. Our evaluation shows that DLLens synthesizes counterparts for 1.84 times as many APIs as those found by state-of-the-art techniques on these libraries. Moreover, under the same time budget, DLLens covers 7.23% more branches and detects 1.88 times as many bugs as state-of-the-art techniques on 200 randomly sampled APIs. DLLens has successfully detected 71 bugs in recent TensorFlow and PyTorch libraries. Among them, 59 are confirmed by developers, including 46 confirmed as previously unknown bugs, and 10 of these previously unknown bugs have been fixed in the latest version of TensorFlow and PyTorch.

摘要

差分测试通过比较不同实现版本间的测试结果，为解决测试预言问题提供了有效策略。然而现有深度学习库差分测试技术面临两大关键挑战：如何为目标API寻找替代实现（称为对应体）以及如何生成多样化测试输入。针对这两个挑战，本文提出DLLens——一种基于大语言模型增强的深度学习库差分测试技术。针对第一个挑战，DLLens设计了基于大语言模型的对应体合成工作流，其核心思想是通过跨库API的组合与适配，可成功合成目标深度学习库API的计算对应体。针对第二个挑战，DLLens采用静态分析技术从API及其对应体实现中提取路径约束，以指导多样化测试输入生成，该过程借助大语言模型对相关深度学习库及其上游库的领域知识实现加速。我们在TensorFlow和PyTorch两大主流深度学习库上评估DLLens。实验表明：DLLens合成的API对应体数量达到现有最优技术的1.84倍；在相同时间预算下，对200个随机抽样API的测试中，DLLens的分支覆盖率提升7.23%，检测到的错误数量是现有技术的1.88倍。DLLens已成功检测出TensorFlow和PyTorch最新版本中的71个错误，其中59个获开发者确认（包含46个此前未知的错误），目前已有10个新发现错误在最新版本中得到修复。

Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant

Abstract

arXiv:2409.11055v3 Announce Type: replace-cross Abstract: Quantization has gained attention as a promising solution for the cost-effective deployment of large and small language models. However, most prior work has been limited to perplexity or basic knowledge tasks and lacks a comprehensive evaluation of recent models like Llama-3.3. In this paper, we conduct a comprehensive evaluation of instruction-tuned models spanning 1B to 405B parameters, applying four quantization methods across 13 datasets. Our findings reveal that (1) quantized models generally surpass smaller FP16 baselines, yet they often struggle with instruction-following and hallucination detection; (2) FP8 consistently emerges as the most robust option across tasks, and AWQ tends to outperform GPTQ in weight-only quantization; (3) smaller models can suffer severe accuracy drops at 4-bit quantization, while 70B-scale models maintain stable performance; (4) notably, \textit{hard} tasks do not always experience the largest accuracy losses, indicating that quantization magnifies a model's inherent weaknesses rather than simply correlating with task difficulty; and (5) an LLM-based judge (MT-Bench) highlights significant performance declines in coding and STEM tasks, though reasoning may sometimes improve.

摘要

量化技术作为一种经济高效部署大、小型语言模型的解决方案已受到广泛关注。然而，先前研究大多局限于困惑度或基础知识任务评估，且缺乏对Llama-3.3等最新模型的全面测评。本文对1B至405B参数规模的指令微调模型进行了系统评估，在13个数据集上应用了四种量化方法。研究发现：（1）量化模型总体优于较小规模的FP16基线模型，但在指令遵循和幻觉检测方面表现欠佳；（2）FP8在所有任务中表现最为稳健，而AWQ在仅权重量化中通常优于GPTQ；（3）4比特量化会使较小模型出现显著精度下降，而70B级模型能保持稳定性能；（4）值得注意的是，\textit{高难度}任务并非总是遭遇最大精度损失，表明量化会放大模型固有缺陷而非简单与任务难度相关；（5）基于LLM的评估器（MT-Bench）显示，编码与STEM任务性能显著下降，但推理能力偶有提升。

E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation

Abstract

arXiv:2411.00437v2 Announce Type: replace-cross Abstract: Retrieval-augmented generation methods often neglect the quality of content retrieved from external knowledge bases, resulting in irrelevant information or potential misinformation that negatively affects the generation results of large language models. In this paper, we propose an end-to-end model with adaptive filtering for retrieval-augmented generation (E2E-AFG), which integrates answer existence judgment and text generation into a single end-to-end framework. This enables the model to focus more effectively on relevant content while reducing the influence of irrelevant information and generating accurate answers. We evaluate E2E-AFG on six representative knowledge-intensive language datasets, and the results show that it consistently outperforms baseline models across all tasks, demonstrating the effectiveness and robustness of the proposed approach.

摘要

检索增强生成方法往往忽视从外部知识库获取内容的质量，导致无关信息或潜在错误信息影响大语言模型的生成结果。本文提出一种用于检索增强生成的自适应过滤端到端模型（E2E-AFG），该模型将答案存在性判断与文本生成整合至单一端到端框架中，使模型能更有效地聚焦相关内容，同时减少无关信息干扰并生成准确答案。我们在六个具有代表性的知识密集型语言数据集上评估E2E-AFG，结果表明该方法在所有任务中均持续优于基线模型，验证了所提方案的有效性与鲁棒性。

XG-NID: Dual-Modality Network Intrusion Detection using a Heterogeneous Graph Neural Network and Large Language Model

Abstract

arXiv:2408.16021v2 Announce Type: replace-cross Abstract: In the rapidly evolving field of cybersecurity, the integration of flow-level and packet-level information for real-time intrusion detection remains a largely untapped area of research. This paper introduces "XG-NID," a novel framework that, to the best of our knowledge, is the first to fuse flow-level and packet-level data within a heterogeneous graph structure, offering a comprehensive analysis of network traffic. Leveraging a heterogeneous graph neural network (GNN) with graph-level classification, XG-NID uniquely enables real-time inference while effectively capturing the intricate relationships between flow and packet payload data. Unlike traditional GNN-based methodologies that predominantly analyze historical data, XG-NID is designed to accommodate the heterogeneous nature of network traffic, providing a robust and real-time defense mechanism. Our framework extends beyond mere classification; it integrates Large Language Models (LLMs) to generate detailed, human-readable explanations and suggest potential remedial actions, ensuring that the insights produced are both actionable and comprehensible. Additionally, we introduce a new set of flow features based on temporal information, further enhancing the contextual and explainable inferences provided by our model. To facilitate practical application and accessibility, we developed "GNN4ID," an open-source tool that enables the extraction and transformation of raw network traffic into the proposed heterogeneous graph structure, seamlessly integrating flow and packet-level data. Our comprehensive quantitative comparative analysis demonstrates that XG-NID achieves an F1 score of 97% in multi-class classification, outperforming existing baseline and state-of-the-art methods. This sets a new standard in Network Intrusion Detection Systems by combining innovative data fusion with enhanced interpretability and real-time capabilities.

摘要

在快速发展的网络安全领域，融合流级与包级信息进行实时入侵检测仍是一个尚未充分开发的研究方向。本文提出的'XG-NID'框架（据我们所知）首次将流级与包级数据整合到异构图结构中，实现了网络流量的综合分析。该框架采用支持图级分类的异构图神经网络（GNN），既能有效捕捉流数据与数据包载荷之间的复杂关联，又能实现实时推理。与主要分析历史数据的传统GNN方法不同，XG-NID专门针对网络流量的异构特性设计，提供了强大的实时防御机制。我们的框架不仅实现分类功能，还集成大语言模型（LLMs）来生成详细的人类可读解释与补救措施建议，确保输出结果兼具可操作性与可理解性。此外，我们引入了一套基于时序信息的新型流特征，进一步增强了模型的上下文感知与可解释推理能力。为促进实际应用，我们开发了开源工具'GNN4ID'，可将原始网络流量提取并转换为所提出的异构图结构，实现流级与包级数据的无缝集成。定量对比分析表明，XG-NID在多分类任务中取得了97%的F1值，超越了现有基线方法和最先进技术。通过创新性的数据融合与增强的可解释性及实时能力相结合，本研究为网络入侵检测系统确立了新标准。

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Abstract

arXiv:2409.12183v3 Announce Type: replace-cross Abstract: Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

摘要

通过提示实现的思维链（CoT）是目前从大型语言模型（LLM）中激发推理能力的实际标准方法。但这种额外的“思考”究竟对哪些任务真正有效？为分析这一问题，我们对100多篇使用CoT的论文进行了定量元分析，并在14个模型上对20个数据集进行了自主评估。结果表明，CoT主要在涉及数学或逻辑的任务上带来显著的性能提升，而在其他类型任务上增益较小。在MMLU数据集上，除非问题或模型响应包含等号（表明存在符号运算与推理），直接生成答案的准确率与使用CoT几乎相同。基于这一发现，我们通过分离规划与执行阶段、并与工具增强型LLM对比，分析了CoT在这些问题上的行为。CoT的大部分增益来源于改进符号执行，但其表现仍逊色于使用符号求解器。我们的研究结果表明，CoT可以选择性应用，在保持性能的同时节省推理成本。此外，这些发现提示我们需要超越基于提示的CoT，转向能更有效利用中间计算的新范式，以覆盖LLM应用的整个范围。

Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play

Abstract

arXiv:2411.08884v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) become more prevalent, concerns about their safety, ethics, and potential biases have risen. Systematically evaluating LLMs' risk decision-making tendencies and attitudes, particularly in the ethical domain, has become crucial. This study innovatively applies the Domain-Specific Risk-Taking (DOSPERT) scale from cognitive science to LLMs and proposes a novel Ethical Decision-Making Risk Attitude Scale (EDRAS) to assess LLMs' ethical risk attitudes in depth. We further propose a novel approach integrating risk scales and role-playing to quantitatively evaluate systematic biases in LLMs. Through systematic evaluation and analysis of multiple mainstream LLMs, we assessed the "risk personalities" of LLMs across multiple domains, with a particular focus on the ethical domain, and revealed and quantified LLMs' systematic biases towards different groups. This research helps understand LLMs' risk decision-making and ensure their safe and reliable application. Our approach provides a tool for identifying and mitigating biases, contributing to fairer and more trustworthy AI systems. The code and data are available.

摘要

随着大型语言模型（LLMs）的日益普及，其安全性、伦理性和潜在偏见问题引发广泛关注。系统评估LLMs的风险决策倾向与态度（尤其在伦理领域）变得至关重要。本研究创新性地将认知科学中的领域特异性风险承担（DOSPERT）量表应用于LLMs，并提出新型伦理决策风险态度量表（EDRAS）以深入评估LLMs的伦理风险态度。我们进一步提出整合风险量表与角色扮演的新方法，用于定量评估LLMs的系统性偏见。通过对多个主流LLMs的系统性评估与分析，我们测量了LLMs跨多领域的'风险人格'（特别聚焦伦理领域），揭示并量化了LLMs对不同群体的系统性偏见。该研究有助于理解LLMs的风险决策机制，确保其安全可靠的应用。我们的方法为识别和消除偏见提供了工具，有助于构建更公平、可信的AI系统。代码与数据已公开。

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Abstract

arXiv:2410.15236v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

摘要

大型语言模型（LLMs）通过提升自然语言理解与生成能力，推动了人工智能的发展，其应用已扩展至医疗健康、软件工程和对话系统等多个领域。尽管过去几年取得显著进展，LLMs仍暴露出严重的安全漏洞，尤其容易受到提示注入和越狱攻击。本文综述了针对这些漏洞的研究现状及现有防御策略。我们将攻击方法大致划分为基于提示的、基于模型的、多模态和多语言四类，涵盖对抗性提示、后门注入及跨模态利用等技术。同时系统梳理了包括提示过滤、转换、对齐技术、多智能体防御和自我调节在内的防御机制，评估其优势与不足。此外，我们讨论了评估LLM安全性与鲁棒性的关键指标和基准测试，指出交互场景中攻击成功量化的挑战以及现有数据集的偏差问题。通过识别当前研究空白，本文提出未来研究方向：发展弹性对齐策略、针对进化攻击的先进防御技术、自动化越狱检测，以及伦理与社会影响的考量。本综述强调人工智能领域需持续开展研究合作，以增强LLM安全性并确保其安全部署。

Atyaephyra at SemEval-2025 Task 4: Low-Rank Negative Preference Optimization

Abstract

arXiv:2503.13690v2 Announce Type: replace-cross Abstract: We present a submission to the SemEval 2025 shared task on unlearning sensitive content from LLMs. Our approach employs negative preference optimization using low-rank adaptation. We show that we can utilize this combination to efficiently compute additional regularization terms, which help with unlearning stabilization. The results of our approach significantly exceed the shared task baselines.

摘要

我们提交了关于从大语言模型中消除敏感内容的SemEval 2025共享任务方案。该方法采用基于低秩适应的负偏好优化技术，通过这种组合策略能够高效计算额外的正则化项，从而有效稳定消除过程。实验结果表明，我们的方法性能显著超越了共享任务的基线水平。

ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning

Abstract

arXiv:2501.01031v3 Announce Type: replace-cross Abstract: Ensuring cultural values alignment in Large Language Models (LLMs) remains a critical challenge, as these models often embed Western-centric biases from their training data, leading to misrepresentations and fairness concerns in cross-cultural applications. Existing approaches such as role assignment and few-shot learning struggle to address these limitations effectively due to their reliance on pre-trained knowledge, limited scalability, and inability to capture nuanced cultural values. To address these issues, we propose ValuesRAG, a novel and effective framework that applies Retrieval-Augmented Generation (RAG) with In-Context Learning (ICL) to integrate cultural and demographic knowledge dynamically during text generation. Leveraging the World Values Survey (WVS) dataset, ValuesRAG first generates summaries of values for each individual. We subsequently curate several representative regional datasets to serve as test datasets and retrieve relevant summaries of values based on demographic features, followed by a reranking step to select the top-k relevant summaries. We evaluate ValuesRAG using 6 diverse regional datasets and show that it consistently outperforms baselines: including zero-shot, role-assignment, few-shot, and hybrid methods, both in main experiments and ablation settings. Notably, ValuesRAG achieves the best overall performance over prior methods, demonstrating its effectiveness in fostering culturally aligned and inclusive AI systems. Our findings underscore the potential of dynamic retrieval-based methods to bridge the gap between global LLM capabilities and localized cultural values.

摘要

确保大型语言模型（LLMs）的文化价值观对齐仍是一个关键挑战，由于这些模型常从训练数据中嵌入西方中心偏见，导致跨文化应用中出现误表征与公平性问题。现有方法如角色分配和小样本学习因依赖预训练知识、可扩展性有限及无法捕捉细微文化价值观，难以有效解决这些局限。为此，我们提出ValuesRAG框架，该创新方案通过检索增强生成（RAG）结合上下文学习（ICL），在文本生成过程中动态整合文化与人口统计知识。基于世界价值观调查（WVS）数据集，ValuesRAG首先生成个体价值观摘要，随后筛选多个代表性地区数据集作为测试集，根据人口特征检索相关价值观摘要，并通过重排序步骤选取前k个相关摘要。我们在6个多样化地区数据集上评估ValuesRAG，结果表明其在主实验与消融设置中均稳定优于基线方法（包括零样本、角色分配、小样本及混合方法）。值得注意的是，ValuesRAG实现了相较于现有方法的最佳综合性能，证明了其在促进文化对齐与包容性AI系统方面的有效性。本研究凸显了基于动态检索的方法在弥合全球LLM能力与本土文化价值观之间鸿沟的潜力。

Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges

Abstract

arXiv:2503.08292v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.

摘要

大型语言模型（LLMs）在医疗保健系统的门诊转诊任务中应用日益广泛。然而，目前缺乏标准化评估标准来衡量其有效性，特别是在动态交互场景中。本研究系统考察了LLMs在智能门诊转诊（IOR）系统中管理任务的能力与局限，并提出专为此类系统设计的综合评估框架。该框架包含两项核心任务：静态评估着重衡量预定义门诊转诊的执行能力，动态评估则通过迭代对话测试转诊建议的优化能力。研究发现，相较于BERT类模型，LLMs优势有限，但在交互对话中展现出发问有效问题的潜力。

Communicating Activations Between Language Model Agents

Abstract

arXiv:2501.14082v2 Announce Type: replace-cross Abstract: Communication between multiple language model (LM) agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for inter-LM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. In this work, we propose a simple technique whereby LMs communicate via activations; concretely, we pause an LM $\textit{B}$ 's computation at an intermediate layer, combine its current activation with another LM $\textit{A}$ 's intermediate activation via some function $\textit{f}$ , then pass $\textit{f}$ 's output into the next layer of $\textit{B}$ and continue the forward pass till decoding is complete. This approach scales up LMs on new tasks with zero additional parameters and data, and saves a substantial amount of compute over natural language communication. We test our method with various functional forms $\textit{f}$ on two experimental setups--multi-player coordination games and reasoning benchmarks--and find that it achieves up to $27.0\%$ improvement over natural language communication across datasets with $<$$1/4$ the compute, illustrating the superiority and robustness of activations as an alternative "language" for communication between LMs.

摘要

多语言模型（LM）代理间的通信已被证明能扩展模型的推理能力。尽管自然语言一直是LM间通信的主要媒介，但其作为标准媒介的合理性尚存疑问：自然语言通信不仅会带来随代理数量和消息量快速攀升的高昂推理成本，而且解码过程会抽象掉过多本可从内部激活中获取的丰富信息。本研究提出一种通过激活值进行通信的简单技术：具体而言，我们在LM $\textit{B}$ 的中间层暂停其计算，将其当前激活与另一LM $\textit{A}$ 的中间激活通过函数 $\textit{f}$ 进行组合，再将 $\textit{f}$ 的输出传递至 $\textit{B}$ 的下一层并继续前向传播直至解码完成。该方法无需额外参数和数据即可扩展LM在新任务上的能力，相比自然语言通信可节省大量计算资源。我们在两种实验设置——多玩家协作游戏和推理基准测试中，采用不同函数形式 $\textit{f}$ 验证本方法，发现其以不足1/4的计算量，在各数据集上相比自然语言通信最高可获得27.0%的性能提升，这证明了激活值作为LM间替代性"语言"的优越性与鲁棒性。

Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems

Abstract

arXiv:2502.18635v2 Announce Type: replace-cross Abstract: While Retrieval Augmented Generation (RAG) has emerged as a popular technique for improving Large Language Model (LLM) systems, it introduces a large number of choices, parameters and hyperparameters that must be made or tuned. This includes the LLM, embedding, and ranker models themselves, as well as hyperparameters governing individual RAG components. Yet, collectively optimizing the entire configuration in a RAG or LLM system remains under-explored - especially in multi-objective settings - due to intractably large solution spaces, noisy objective evaluations, and the high cost of evaluations. In this work, we introduce the first approach for multi-objective parameter optimization of cost, latency, safety and alignment over entire LLM and RAG systems. We find that Bayesian optimization methods significantly outperform baseline approaches, obtaining a superior Pareto front on two new RAG benchmark tasks. We conclude our work with important considerations for practitioners who are designing multi-objective RAG systems, highlighting nuances such as how optimal configurations may not generalize across tasks and objectives.

摘要

尽管检索增强生成（RAG）已成为改进大型语言模型（LLM）系统的流行技术，但它引入了大量需要选择或调优的参数和超参数。这包括LLM、嵌入和排序模型本身，以及控制各个RAG组件的超参数。然而，由于解决方案空间庞大、目标评估存在噪声以及评估成本高昂，对整个RAG或LLM系统配置进行集体优化（尤其是在多目标场景下）仍未被充分探索。本研究首次提出了针对整个LLM和RAG系统在成本、延迟、安全性和对齐性方面的多目标参数优化方法。我们发现贝叶斯优化方法显著优于基线方法，在两个新的RAG基准任务上获得了更优的帕累托前沿。最后，我们为设计多目标RAG系统的实践者提出了重要考量，强调了诸如最优配置可能无法跨任务和目标泛化等细微差别。

Safety Evaluation of DeepSeek Models in Chinese Contexts

Abstract

arXiv:2502.11137v3 Announce Type: replace-cross Abstract: Recently, the DeepSeek series of models, leveraging their exceptional reasoning capabilities and open-source strategy, is reshaping the global AI landscape. Despite these advantages, they exhibit significant safety deficiencies. Research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 has a 100% attack success rate when processing harmful prompts. Additionally, multiple safety companies and research institutions have confirmed critical safety vulnerabilities in this model. As models demonstrating robust performance in Chinese and English, DeepSeek models require equally crucial safety assessments in both language contexts. However, current research has predominantly focused on safety evaluations in English environments, leaving a gap in comprehensive assessments of their safety performance in Chinese contexts. In response to this gap, this study introduces CHiSafetyBench, a Chinese-specific safety evaluation benchmark. This benchmark systematically evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts, revealing their performance across safety categories. The experimental results quantify the deficiencies of these two models in Chinese contexts, providing key insights for subsequent improvements. It should be noted that, despite our efforts to establish a comprehensive, objective, and authoritative evaluation benchmark, the selection of test samples, characteristics of data distribution, and the setting of evaluation criteria may inevitably introduce certain biases into the evaluation results. We will continuously optimize the evaluation benchmark and periodically update this report to provide more comprehensive and accurate assessment outcomes. Please refer to the latest version of the paper for the most recent evaluation results and conclusions.

摘要

近期，DeepSeek系列模型凭借其卓越的推理能力和开源策略，正在重塑全球人工智能格局。尽管具备这些优势，该系列模型仍存在显著的安全缺陷。思科旗下Robust Intelligence与宾夕法尼亚大学联合研究表明，DeepSeek-R1在处理有害提示时攻击成功率高达100%。此外，多家安全公司与研究机构均证实该模型存在重大安全漏洞。作为在中英文语境下均展现强劲性能的模型，DeepSeek系列需要同等重要的双语安全评估。然而现有研究主要集中于英语环境的安全评测，对其在中文语境下的安全性能缺乏系统评估。针对这一空白，本研究提出中文专属安全评估基准CHiSafetyBench，系统评估了DeepSeek-R1与DeepSeek-V3在中文语境下的安全性，揭示其在各安全维度的表现。实验结果量化了两个模型在中文语境下的缺陷，为后续改进提供了关键依据。需要说明的是，尽管我们致力于建立全面、客观且权威的评估基准，但测试样本的选择、数据分布特征及评价标准的设定仍可能使评估结果存在一定偏差。我们将持续优化评估基准并定期更新本报告，以提供更全面准确的评估结果。最新评估数据与结论请以论文最新版本为准。

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Abstract

arXiv:2503.22688v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions, offering limited insight into their capabilities to generate code that strictly follows users' instructions, especially in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating LLMs' instruction-following capabilities in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. We evaluate nine prominent LLMs using CodeIF-Bench, and the experimental results reveal a significant disparity between their basic programming capability and instruction-following capability, particularly as task complexity, context length, and the number of dialogue rounds increase.

摘要

大语言模型（LLMs）在代码生成任务中展现出卓越性能，已成为开发者不可或缺的编程助手。然而，现有代码生成基准主要评估LLMs在单轮交互中生成代码的功能正确性，对其严格遵循用户指令生成代码的能力（尤其在多轮交互场景中）的洞察有限。本文提出CodeIF-Bench，一个用于评估LLMs在交互式代码生成中指令遵循能力的基准。具体而言，CodeIF-Bench整合了九类符合真实软件开发需求的可验证指令，这些指令可通过指定测试用例独立客观地验证，从而支持多轮交互中指令遵循能力的评估。我们对九种主流LLMs进行CodeIF-Bench测试，实验结果表明其基础编程能力与指令遵循能力存在显著差距，且该差距随任务复杂度、上下文长度及对话轮次增加而扩大。

Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques

Abstract

arXiv:2505.02309v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments. We examine three primary approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For each technique, we discuss the underlying principles, present different variants, and provide examples of successful applications. We also briefly discuss complementary techniques such as mixture-of-experts and early-exit strategies. Finally, we highlight promising future directions, aiming to provide a valuable resource for both researchers and practitioners seeking to optimize LLMs for edge deployment.

摘要

大语言模型（LLMs）已经彻底改变了人工智能（AI）的许多领域，但其巨大的资源需求限制了它们在移动和边缘设备上的部署。本综述论文全面概述了压缩LLMs的技术，以实现在资源受限环境中的高效推理。我们研究了三种主要方法：知识蒸馏、模型量化和模型剪枝。针对每种技术，我们讨论了其基本原理，介绍了不同的变体，并提供了成功应用的示例。我们还简要讨论了混合专家和早期退出策略等补充技术。最后，我们指出了未来有前景的研究方向，旨在为研究人员和实践者提供一个有价值的资源，帮助他们优化LLMs以实现边缘部署。

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Abstract

arXiv:2504.08837v3 Announce Type: replace-cross Abstract: Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.

摘要

近期，GPT-o1和DeepSeek-R1等慢思考系统通过显式反思在解决复杂问题方面展现出巨大潜力。它们在各类数学与科学基准测试中显著优于GPT-4o等最佳快思考模型，但其多模态推理能力仍与快思考模型相当。例如，GPT-o1在MathVista、MathVerse和MathVision等基准上的表现与快思考模型相似。本文旨在通过强化学习（不依赖蒸馏技术）增强视觉语言模型的慢思考能力，以推动技术前沿。首先，我们采用GRPO算法结合选择性样本重放（SSR）新技术解决优势消失问题。虽然该方法表现出色，但所得RL训练模型的自反思或自验证能力有限。为进一步促进慢思考，我们提出强制再思考机制，在RL训练轨迹末端添加再思考触发标记，显式强制执行自反思推理步骤。通过结合这两种技术，我们的VL-Rethinker模型将MathVista和MathVerse的先进水平分别提升至80.4%和63.5%。该模型还在MathVision、MMMU-Pro、EMMA和MEGA-Bench等多学科基准测试中取得开源领域最优成绩，缩小了与OpenAI-o1的差距。实证结果验证了我们方法的有效性。

Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks

Abstract

arXiv:2503.15169v2 Announce Type: replace-cross Abstract: The application of large language models (LLMs) to healthcare information extraction has emerged as a promising approach. This study evaluates the classification performance of five open-source LLMs: GEMMA-3-27B-IT, LLAMA3-70B, LLAMA4-109B, DEEPSEEK-R1-DISTILL-LLAMA-70B, and DEEPSEEK-V3-0324-UD-Q2_K_XL, across six healthcare-related classification tasks involving both social media data (breast cancer, changes in medication regimen, adverse pregnancy outcomes, potential COVID-19 cases) and clinical data (stigma labeling, medication change discussion). We report precision, recall, and F1 scores with 95% confidence intervals for all model-task combinations. Our findings reveal significant performance variability between LLMs, with DeepSeekV3 emerging as the strongest overall performer, achieving the highest F1 scores in four tasks. Notably, models generally performed better on social media tasks compared to clinical data tasks, suggesting potential domain-specific challenges. GEMMA-3-27B-IT demonstrated exceptionally high recall despite its smaller parameter count, while LLAMA4-109B showed surprisingly underwhelming performance compared to its predecessor LLAMA3-70B, indicating that larger parameter counts do not guarantee improved classification results. We observed distinct precision-recall trade-offs across models, with some favoring sensitivity over specificity and vice versa. These findings highlight the importance of task-specific model selection for healthcare applications, considering the particular data domain and precision-recall requirements rather than model size alone. As healthcare increasingly integrates AI-driven text classification tools, this comprehensive benchmarking provides valuable guidance for model selection and implementation while underscoring the need for continued evaluation and domain adaptation of LLMs in healthcare contexts.

摘要

大型语言模型（LLMs）在医疗健康信息提取领域的应用已成为一种颇具前景的研究方向。本研究评估了五种开源LLMs（GEMMA-3-27B-IT、LLAMA3-70B、LLAMA4-109B、DEEPSEEK-R1-DISTILL-LLAMA-70B和DEEPSEEK-V3-0324-UD-Q2_K_XL）在六项医疗健康分类任务中的表现，任务涵盖社交媒体数据（乳腺癌、用药方案变更、不良妊娠结局、潜在COVID-19病例）和临床数据（污名化标注、药物变更讨论）。我们报告了所有模型-任务组合的精确率、召回率及95%置信区间的F1分数。研究发现不同LLMs存在显著性能差异，其中DeepSeekV3整体表现最优，在四项任务中获得最高F1分数。值得注意的是，模型在社交媒体任务中的表现普遍优于临床数据任务，提示可能存在领域特异性挑战。GEMMA-3-27B-IT尽管参数量较小却展现出极高的召回率，而LLAMA4-109B相较前代LLAMA3-70B表现意外欠佳，表明更大参数量并不能保证分类效果提升。我们观察到不同模型存在明显的精确率-召回率权衡，部分模型更侧重敏感性而非特异性，反之亦然。这些发现凸显了医疗应用中选择任务适配模型的重要性，需综合考虑特定数据领域和精确率-召回率需求，而非仅关注模型规模。随着医疗领域日益整合AI驱动的文本分类工具，这项全面基准测试为模型选择与实施提供了重要指导，同时强调LLMs在医疗场景中持续评估与领域适应的必要性。

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Abstract

arXiv:2505.03005v2 Announce Type: replace-cross Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

摘要

我们提出"大规模快速注意力蒸馏至线性注意力解码器"(RADLADS)协议，该方案能快速将softmax注意力Transformer模型转换为线性注意力解码器模型。同时我们推出两种新型RWKV变体架构，以及从Qwen2.5开源模型转换而来的70亿、320亿和720亿参数规模模型。我们的转换过程仅需3.5-7亿token，不足原始教师模型训练token量的0.005%。转换为720亿参数线性注意力模型的成本按当前价格计算低于2000美元，而推理质量仍接近原Transformer模型。这些模型在其规模级别的线性注意力模型中，在一系列标准基准测试上实现了最先进的下游性能。我们将所有模型在Apache 2.0许可下发布于HuggingFace平台（720亿参数模型同时受Qwen许可协议约束）。

Enigme: Generative Text Puzzles for Evaluating Reasoning in Language Models
- Abstract
- 摘要
Position: Epistemic Artificial Intelligence is Essential for Machine Learning Models to Know When They Do Not Know
- Abstract
- 摘要
Towards Artificial Intelligence Research Assistant for Expert-Involved Learning
- Abstract
- 摘要
Large Language Models are Autonomous Cyber Defenders
- Abstract
- 摘要
Exploring Influence Factors on LLM Suitability for No-Code Development of End User IoT Applications
- Abstract
- 摘要
Text2Cypher: Data Pruning using Hard Example Selection
- Abstract
- 摘要
The Promise and Limits of LLMs in Constructing Proofs and Hints for Logic Problems in Intelligent Tutoring Systems
- Abstract
- 摘要
Enhancing Text2Cypher with Schema Filtering
- Abstract
- 摘要
ChemRxivQuest: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv Preprints
- Abstract
- 摘要
MARK: Memory Augmented Refinement of Knowledge
- Abstract
- 摘要
Multi-agent Embodied AI: Advances and Future Directions
- Abstract
- 摘要
CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models
- Abstract
- 摘要
EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation
- Abstract
- 摘要
HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow
- Abstract
- 摘要
How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks
- Abstract
- 摘要
Conversational Process Model Redesign
- Abstract
- 摘要
Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs
- Abstract
- 摘要
Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising
- Abstract
- 摘要
MatMMFuse: Multi-Modal Fusion model for Material Property Prediction
- Abstract
- 摘要
When Bad Data Leads to Good Models
- Abstract
- 摘要
Advancing Conversational Diagnostic AI with Multimodal Reasoning
- Abstract
- 摘要
A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models
- Abstract
- 摘要
QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving Query-By-Document Search Using LLM-Reranking with Reduced Human Effort
- Abstract
- 摘要
REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM
- Abstract
- 摘要
Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
- Abstract
- 摘要
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
- Abstract
- 摘要
HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights
- Abstract
- 摘要
GroverGPT-2: Simulating Grover's Algorithm via Chain-of-Thought Reasoning and Quantum-Native Tokenization
- Abstract
- 摘要
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning
- Abstract
- 摘要
SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
- Abstract
- 摘要
An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education
- Abstract
- 摘要
Chain-of-Thought Tokens are Computer Program Variables
- Abstract
- 摘要
LVLM-MPC Collaboration for Autonomous Driving: A Safety-Aware and Task-Scalable Control Architecture
- Abstract
- 摘要
Rethinking Invariance in In-context Learning
- Abstract
- 摘要
Understanding In-context Learning of Addition via Activation Subspaces
- Abstract
- 摘要
Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models
- Abstract
- 摘要
Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design
- Abstract
- 摘要
Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks
- Abstract
- 摘要
Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLMs and Agents
- Abstract
- 摘要
Reasoning Models Don't Always Say What They Think
- Abstract
- 摘要
Scalable Chain of Thoughts via Elastic Reasoning
- Abstract
- 摘要
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
- Abstract
- 摘要
Crosslingual Reasoning through Test-Time Scaling
- Abstract
- 摘要
ComPO: Preference Alignment via Comparison Oracles
- Abstract
- 摘要
TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering
- Abstract
- 摘要
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute
- Abstract
- 摘要
A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law
- Abstract
Generating Symbolic World Models via Test-time Scaling of Large Language Models
- Abstract
- 摘要
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
- Abstract
- 摘要
MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind
- Abstract
- 摘要
Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems
- Abstract
- 摘要
Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
- Abstract
- 摘要
HORAE: A Domain-Agnostic Language for Automated Service Regulation
- Abstract
- 摘要
Enhancing Differential Testing With LLMs For Testing Deep Learning Libraries
- Abstract
- 摘要
Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant
- Abstract
- 摘要
E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation
- Abstract
- 摘要
XG-NID: Dual-Modality Network Intrusion Detection using a Heterogeneous Graph Neural Network and Large Language Model
- Abstract
- 摘要
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
- Abstract
- 摘要
Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play
- Abstract
- 摘要
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
- Abstract
- 摘要
Atyaephyra at SemEval-2025 Task 4: Low-Rank Negative Preference Optimization
- Abstract
- 摘要
ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning
- Abstract
- 摘要
Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges
- Abstract
- 摘要
Communicating Activations Between Language Model Agents
- Abstract
- 摘要
Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems
- Abstract
- 摘要
Safety Evaluation of DeepSeek Models in Chinese Contexts
- Abstract
- 摘要
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
- Abstract
- 摘要
Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques
- Abstract
- 摘要
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
- Abstract
- 摘要
Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks
- Abstract
- 摘要
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
- Abstract
- 摘要

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要